replacing ugly for loops

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

replacing ugly for loops

andrewH
I have a couple of hundred American Community Survey Summary Files files containing rectangular arrays of data, mainly though not exclusively numeric.  Each file is referred to as a sequence (henceforth "seq").  From these files I am trying to extract particular subsets (tables) consisting of a sets of columns.  These tables are defined by three numbers (now in columns in a data frame):
1. a file identifier (seq)
2. first column position numbers (startNo)
3. length of table (len)
so the columns to select for one triple would consist of startNo:(startNo+length-1).   I am trying to create for each sequence a vector of all the column numbers for tables in that sequence.

Obviously I could do this with nested for loops,e.g..

> seq <- c(1,1,2,2)
> startNo  <- c(3, 10, 3, 15)
> len <- c(4, 2, 5, 3)
> data.df <- data.frame(seq, startNo, len)
>
> seq.f <- factor(data.df$seq)
> data.l <- split(data.df, seq.f)
> selectColsList<- vector("list", length(levels(seq.f)))
> for (i in seq_along(levels(seq.f))){
   selectCols <- numeric()
       for (j in seq_along(data.l[[i]]$startNo)){
           selectCols <- c(selectCols,  data.l[[i]]$startNo[j]:(data.l[[i]]$startNo[j]
           data.l[[i]]$len[j]-1))
        }
    selectColsList[[i]] <- selectCols
}
> selectColsList
[[1]]
[1]  3  4  5  6 10 11
[[2]]
[1]  3  4  5  6  7 15 16 17

But this code strikes me as inelegant and verbose. It seems to me that there ought to be a way to make the outer loop, (indexed with i) into a tapply function (which is why I started with a split()), and the inner loop (indexed with j) into some cute recursive function, but I was not able to do so. If anyone could suggest some nicer (e.g. shorter, or faster, or just more sophisticated) way to do this instead, I would be most grateful.

Sincerely, andrewH
Reply | Threaded
Open this post in threaded view
|

Re: replacing ugly for loops

Bert Gunter
I am not sure you have expressed what you wanjt to do correctly. See inline:

On Wed, Oct 10, 2012 at 9:10 PM, andrewH <[hidden email]> wrote:
> I have a couple of hundred American Community Survey Summary Files files
> containing rectangular arrays of data, mainly though not exclusively
> numeric.  Each file is referred to as a sequence (henceforth "seq").
-- so 1 "seq" (terrible identifier -- see below for why) = 1 file

 From
> these files I am trying to extract particular subsets (tables) consisting of
> a sets of columns.  These tables are defined by three numbers (now in
> columns in a data frame):
> 1.      a file identifier (seq)
> 2.      first column position numbers (startNo)
> 3.      length of table (len)

So your data frame, call it yourframe, has columns named:

seq      startNo       len


> so the columns to select for one triple would consist of
> startNo:(startNo+length-1).   I am trying to create for each sequence a
> vector of all the column numbers for tables in that sequence.

So for each seq id you want to find all the column numbers, right?

sq.n <- seq_len(nrow(yourframe)) ## Just to make it easier to read
colms <-  tapply(sq.n, yourframe$seq,function(x) with(yourframe[x,],
   sort(unique(do.call(c, mapply(seq, from=startNo,
length=len,SIMPLIFY = FALSE)))))

## Comments
In the mapply call, seq is the R function, ?seq.  That's why using it
as a name for a file id is terrible -- it causes confusion.

In the absence of data, this is untested -- and probably not quite
right. But it should be close, I hope. The key idea is the use of
mapply to get the sequence of columns for each row in all the rows for
each seq id. The SIMPLIFY = FALSE guarantees that this yields a list
of vectors of column indices, which are then glopped together and
cleaned up by the sort(unique(do.call(  ...  stuff.

colms should then be a list giving the sorted column numbers to choose
for each "seq" id.

I do not know whether (once cleaned up,) this is either more elegant
or more efficient than what you proposed. And I wouldn't be surprised
if someone like Bill Dunlap comes up with a lot better way, either.
But it is different -- and perhaps amusing.

... If I have properly understood what you wanted. If not, ignore all.

Cheers,
Bert

>
> Obviously I could do this with nested for loops,e.g..
>
>> seq <- c(1,1,2,2)
>> startNo  <- c(3, 10, 3, 15)
>> len <- c(4, 2, 5, 3)
>> data.df <- data.frame(seq, startNo, len)
>>
>> seq.f <- factor(data.df$seq)
>> data.l <- split(data.df, seq.f)
>> selectColsList<- vector("list", length(levels(seq.f)))
>> for (i in seq_along(levels(seq.f))){
>    selectCols <- numeric()
>        for (j in seq_along(data.l[[i]]$startNo)){
>            selectCols <- c(selectCols,
> data.l[[i]]$startNo[j]:(data.l[[i]]$startNo[j]
>            data.l[[i]]$len[j]-1))
>         }
>     selectColsList[[i]] <- selectCols
> }
>> selectColsList
> [[1]]
> [1]  3  4  5  6 10 11
> [[2]]
> [1]  3  4  5  6  7 15 16 17
>
> But this code strikes me as inelegant and verbose. It seems to me that there
> ought to be a way to make the outer loop, (indexed with i) into a tapply
> function (which is why I started with a split()), and the inner loop
> (indexed with j) into some cute recursive function, but I was not able to do
> so. If anyone could suggest some nicer (e.g. shorter, or faster, or just
> more sophisticated) way to do this instead, I would be most grateful.
>
> Sincerely, andrewH
>
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/replacing-ugly-for-loops-tp4645821.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: replacing ugly for loops

Bert Gunter
Sorry, you **did** supply data and my solution **does** work (except I
left off 1 closing ")" .

> sq.n <- seq_len(nrow(data.df))
> tapply(sq.n,data.df$seq,function(x)with(data.df[x,],
+ sort(unique(do.call(c,mapply(seq,from=startNo,length=len,SIMPLIFY=FALSE))))))
$`1`
[1]  3  4  5  6 10 11

$`2`
[1]  3  4  5  6  7 15 16 17

Cheers,
Bert


On Wed, Oct 10, 2012 at 10:59 PM, Bert Gunter <[hidden email]> wrote:

> I am not sure you have expressed what you wanjt to do correctly. See inline:
>
> On Wed, Oct 10, 2012 at 9:10 PM, andrewH <[hidden email]> wrote:
>> I have a couple of hundred American Community Survey Summary Files files
>> containing rectangular arrays of data, mainly though not exclusively
>> numeric.  Each file is referred to as a sequence (henceforth "seq").
> -- so 1 "seq" (terrible identifier -- see below for why) = 1 file
>
>  From
>> these files I am trying to extract particular subsets (tables) consisting of
>> a sets of columns.  These tables are defined by three numbers (now in
>> columns in a data frame):
>> 1.      a file identifier (seq)
>> 2.      first column position numbers (startNo)
>> 3.      length of table (len)
>
> So your data frame, call it yourframe, has columns named:
>
> seq      startNo       len
>
>
>> so the columns to select for one triple would consist of
>> startNo:(startNo+length-1).   I am trying to create for each sequence a
>> vector of all the column numbers for tables in that sequence.
>
> So for each seq id you want to find all the column numbers, right?
>
> sq.n <- seq_len(nrow(yourframe)) ## Just to make it easier to read
> colms <-  tapply(sq.n, yourframe$seq,function(x) with(yourframe[x,],
>    sort(unique(do.call(c, mapply(seq, from=startNo,
> length=len,SIMPLIFY = FALSE)))))
>
> ## Comments
> In the mapply call, seq is the R function, ?seq.  That's why using it
> as a name for a file id is terrible -- it causes confusion.
>
> In the absence of data, this is untested -- and probably not quite
> right. But it should be close, I hope. The key idea is the use of
> mapply to get the sequence of columns for each row in all the rows for
> each seq id. The SIMPLIFY = FALSE guarantees that this yields a list
> of vectors of column indices, which are then glopped together and
> cleaned up by the sort(unique(do.call(  ...  stuff.
>
> colms should then be a list giving the sorted column numbers to choose
> for each "seq" id.
>
> I do not know whether (once cleaned up,) this is either more elegant
> or more efficient than what you proposed. And I wouldn't be surprised
> if someone like Bill Dunlap comes up with a lot better way, either.
> But it is different -- and perhaps amusing.
>
> ... If I have properly understood what you wanted. If not, ignore all.
>
> Cheers,
> Bert
>
>>
>> Obviously I could do this with nested for loops,e.g..
>>
>>> seq <- c(1,1,2,2)
>>> startNo  <- c(3, 10, 3, 15)
>>> len <- c(4, 2, 5, 3)
>>> data.df <- data.frame(seq, startNo, len)
>>>
>>> seq.f <- factor(data.df$seq)
>>> data.l <- split(data.df, seq.f)
>>> selectColsList<- vector("list", length(levels(seq.f)))
>>> for (i in seq_along(levels(seq.f))){
>>    selectCols <- numeric()
>>        for (j in seq_along(data.l[[i]]$startNo)){
>>            selectCols <- c(selectCols,
>> data.l[[i]]$startNo[j]:(data.l[[i]]$startNo[j]
>>            data.l[[i]]$len[j]-1))
>>         }
>>     selectColsList[[i]] <- selectCols
>> }
>>> selectColsList
>> [[1]]
>> [1]  3  4  5  6 10 11
>> [[2]]
>> [1]  3  4  5  6  7 15 16 17
>>
>> But this code strikes me as inelegant and verbose. It seems to me that there
>> ought to be a way to make the outer loop, (indexed with i) into a tapply
>> function (which is why I started with a split()), and the inner loop
>> (indexed with j) into some cute recursive function, but I was not able to do
>> so. If anyone could suggest some nicer (e.g. shorter, or faster, or just
>> more sophisticated) way to do this instead, I would be most grateful.
>>
>> Sincerely, andrewH
>>
>>
>>
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/replacing-ugly-for-loops-tp4645821.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> Internal Contact Info:
> Phone: 467-7374
> Website:
> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm



--

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: replacing ugly for loops

andrewH
Dear Bert--
I tried your function on the data that I provided (data.df) and it worked beautifully (after I added a missing final parenthesis), producing exactly the same output as my function.  This is an excellent example of what I was looking for, because it is
   (a) 50% shorter than mine,
   (b) fully vectorized, and
   (c) uses three functions that I have never used before: with, unique, and do.call

I am going to spend a happy afternoon working through this command by command and at the end I am confident that I will have learned some valuable new ( to me) tricks.
Thanks!
Warmest Regards, AndrewH
Reply | Threaded
Open this post in threaded view
|

Re: replacing ugly for loops

Bert Gunter
I hate to decline such praise, but honesty demands that I must.

In fact, my solution is **not** fully vectorized at all! The tapply()
and mapply() calls are, in fact, in some sense hidden loops at the
interpreted levels. They do have the virtue of being true to R's
functional paradigm, but they are loops, nevertheless. For this
reason, they may not be more efficient then the explicit loops you've
written. But I hope the code is more transparent.

AndI did send a follow-up note to the list both acknowledging my
erroneous accusation that you did not provide data and confirming that
my proposed solution worked with the example you did, in fact,
provide.

But thanks for the kind words anyway.

-- Bert

On Thu, Oct 11, 2012 at 2:16 PM, andrewH <[hidden email]> wrote:

> Dear Bert--
> I tried your function on the data that I provided (data.df) and it worked
> beautifully (after I added a missing final parenthesis), producing exactly
> the same output as my function.  This is an excellent example of what I was
> looking for, because it is
>    (a) 50% shorter than mine,
>    (b) fully vectorized, and
>    (c) uses three functions that I have never used before: with, unique, and
> do.call
>
> I am going to spend a happy afternoon working through this command by
> command and at the end I am confident that I will have learned some valuable
> new ( to me) tricks.
> Thanks!
> Warmest Regards, AndrewH
>
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/replacing-ugly-for-loops-tp4645821p4645914.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.