apply with multiple conditions

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

apply with multiple conditions

pguilha
Hello all,

I have written a for loop to act on a dataframe with close to 3million rows and 6 columns and I would like to pass it to apply() to speed the process up (I let the loop run for 2 days before stopping it and it had only gone through 200,000 rows) but I am really struggling to find a way to pass the arguments. Below are the loop and the head of the dataframe I am working on.
Any hints would be much appreciated, thank you! (I have searched for this but could not find any other posts doing quite what I want)
Paul

x<-as.numeric(all.tf7[1,2])
for (i in 2:nrow(all.tf7)) {
  if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)<115341) all.tf7[i,6]<-all.tf7[i-1,6]
  else if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)>=115341) {
    all.tf7[i,6]<-(all.tf7[i-1,6]+1)
    x<-as.numeric(all.tf7[i,2]) }
  else if (all.tf7[i,1]!=all.tf7[i-1,1])  {
    all.tf7[i,6]<-(all.tf7[i-1,6]+1)
    x<-as.numeric(all.tf7[i,2]) }
}

#the aim here is to attribute a bin number to each row so that I can then split the dataframe according to those bins.


chrom chromStart chromEnd         name cumsum bin
chr1      10089             10309               ZBTB33  10089   1
chr1      10132             10536      TAF7_(SQ-8)  20221   1
chr1      10133             10362            Pol2-4H8  30354   1
chr1      10148             10418  MafF_(M8194)  40502   1
chr1      10382             10578                ZBTB33  50884   1
chr1      16132             16352                    CTCF  67016   1
Reply | Threaded
Open this post in threaded view
|

Re: apply with multiple conditions

Adams, Jean
Paul,

My interpretation is that you are trying to assign a new bin number to a
row every time the variable chrom changes and every time the variable
chromStart changes by 115341 or more.  Is that right?  If so, you don't
need a loop at all.  Check out the code below.  I made a couple changes to
the all.tf7 example data frame so that it would have two changes in bin
number, one based on the chrom variable and one based on the chromStart
variable.

Jean

all.tf7 <- data.frame(
        chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
        chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
        chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
        name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8", "MafF_(M8194)",
"ZBTB33", "CTCF"),
        cumsum = c(10089, 20221, 30354, 40502, 50884, 67016),
        bin = rep(NA, 6)
        )

# assign a new bin every time chrom changes and every time chromStart
changes by 115341 or more
L <- nrow(all.tf7)
prev.chrom <- c(NA, all.tf7$chrom[-L])
delta.start <- c(NA, all.tf7$chromStart[-1] - all.tf7$chromStart[-L])
new.bin <- is.na(prev.chrom) | all.tf7$chrom != prev.chrom | delta.start
>= 115341
all.tf7$bin <- cumsum(new.bin)
all.tf7


pguilha <[hidden email]> wrote on 07/02/2012 06:25:13 AM:

> Hello all,
>
> I have written a for loop to act on a dataframe with close to 3million
rows
> and 6 columns and I would like to pass it to apply() to speed the
process up
> (I let the loop run for 2 days before stopping it and it had only gone
> through 200,000 rows) but I am really struggling to find a way to pass
the
> arguments. Below are the loop and the head of the dataframe I am working
on.
> Any hints would be much appreciated, thank you! (I have searched for
this

> but could not find any other posts doing quite what I want)
> Paul
>
> x<-as.numeric(all.tf7[1,2])
> for (i in 2:nrow(all.tf7)) {
>   if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)<115341)
> all.tf7[i,6]<-all.tf7[i-1,6]
>   else if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)>=115341) {
>     all.tf7[i,6]<-(all.tf7[i-1,6]+1)
>     x<-as.numeric(all.tf7[i,2]) }
>   else if (all.tf7[i,1]!=all.tf7[i-1,1])  {
>     all.tf7[i,6]<-(all.tf7[i-1,6]+1)
>     x<-as.numeric(all.tf7[i,2]) }
> }
>
> #the aim here is to attribute a bin number to each row so that I can
then

> split the dataframe according to those bins.
>
>
> chrom chromStart chromEnd         name cumsum bin
> chr1      10089             10309               ZBTB33  10089   1
> chr1      10132             10536      TAF7_(SQ-8)  20221   1
> chr1      10133             10362            Pol2-4H8  30354   1
> chr1      10148             10418  MafF_(M8194)  40502   1
> chr1      10382             10578                ZBTB33  50884   1
> chr1      16132             16352                    CTCF  67016   1
        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: apply with multiple conditions

pguilha
Thanks for your reply Jean,

I think your interpretation is correct but when I run your code I end
up with the below dataframe and obviously the bins created there don't
correspond to a chromStart change of 115341:

  chrom chromStart chromEnd         name cumsum bin
1  chr1      10089    10309       ZBTB33  10089   1
2  chr1      10132    10536  TAF7_(SQ-8)  20221   2
3  chr2      10133    10362     Pol2-4H8  30354   3
4  chr2      10148    10418 MafF_(M8194)  40502   4
5  chr2     210382   210578       ZBTB33  50884   5
6  chr2     216132   216352         CTCF  67016   6

the first two rows should have the same bin number (same chrom,
<115341 diff), then rows 3&4 should be in another bin (different chrom
from rows 1&2, <115341 diff), and rows 5&6 in another one (same chrom
but >115341 difference between row 4 and row 5).

it seems the new.bin line of your code isn't quite doing what it
should but I can't pinpoint the error there...
Paul


On 2 July 2012 14:19, Jean V Adams <[hidden email]> wrote:

> Paul,
>
> My interpretation is that you are trying to assign a new bin number to a row
> every time the variable chrom changes and every time the variable chromStart
> changes by 115341 or more.  Is that right?  If so, you don't need a loop at
> all.  Check out the code below.  I made a couple changes to the all.tf7
> example data frame so that it would have two changes in bin number, one
> based on the chrom variable and one based on the chromStart variable.
>
> Jean
>
> all.tf7 <- data.frame(
>         chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
>         chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
>         chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
>         name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8", "MafF_(M8194)",
> "ZBTB33", "CTCF"),
>         cumsum = c(10089, 20221, 30354, 40502, 50884, 67016),
>         bin = rep(NA, 6)
>         )
>
> # assign a new bin every time chrom changes and every time chromStart
> changes by 115341 or more
> L <- nrow(all.tf7)
> prev.chrom <- c(NA, all.tf7$chrom[-L])
> delta.start <- c(NA, all.tf7$chromStart[-1] - all.tf7$chromStart[-L])
> new.bin <- is.na(prev.chrom) | all.tf7$chrom != prev.chrom | delta.start >=
> 115341
> all.tf7$bin <- cumsum(new.bin)
> all.tf7
>
>
> pguilha <[hidden email]> wrote on 07/02/2012 06:25:13 AM:
>
>> Hello all,
>>
>> I have written a for loop to act on a dataframe with close to 3million
>> rows
>> and 6 columns and I would like to pass it to apply() to speed the process
>> up
>> (I let the loop run for 2 days before stopping it and it had only gone
>> through 200,000 rows) but I am really struggling to find a way to pass the
>> arguments. Below are the loop and the head of the dataframe I am working
>> on.
>> Any hints would be much appreciated, thank you! (I have searched for this
>> but could not find any other posts doing quite what I want)
>> Paul
>>
>> x<-as.numeric(all.tf7[1,2])
>> for (i in 2:nrow(all.tf7)) {
>>   if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)<115341)
>> all.tf7[i,6]<-all.tf7[i-1,6]
>>   else if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)>=115341) {
>>     all.tf7[i,6]<-(all.tf7[i-1,6]+1)
>>     x<-as.numeric(all.tf7[i,2]) }
>>   else if (all.tf7[i,1]!=all.tf7[i-1,1])  {
>>     all.tf7[i,6]<-(all.tf7[i-1,6]+1)
>>     x<-as.numeric(all.tf7[i,2]) }
>> }
>>
>> #the aim here is to attribute a bin number to each row so that I can then
>> split the dataframe according to those bins.
>>
>>
>> chrom chromStart chromEnd         name cumsum bin
>> chr1      10089             10309               ZBTB33  10089   1
>> chr1      10132             10536      TAF7_(SQ-8)  20221   1
>> chr1      10133             10362            Pol2-4H8  30354   1
>> chr1      10148             10418  MafF_(M8194)  40502   1
>> chr1      10382             10578                ZBTB33  50884   1
>> chr1      16132             16352                    CTCF  67016   1

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: apply with multiple conditions

Adams, Jean
Paul,

Are you submitting the exact code that I included in my previous e-mail?
When I submit that code, I get this ...

  chrom chromStart chromEnd         name cumsum bin
1  chr1      10089    10309       ZBTB33  10089   1
2  chr1      10132    10536  TAF7_(SQ-8)  20221   1
3  chr2      10133    10362     Pol2-4H8  30354   2
4  chr2      10148    10418 MafF_(M8194)  40502   2
5  chr2     210382   210578       ZBTB33  50884   3
6  chr2     216132   216352         CTCF  67016   3

Jean


Paul Guilhamon <[hidden email]> wrote on 07/02/2012 08:59:00 AM:

> Thanks for your reply Jean,
>
> I think your interpretation is correct but when I run your code I end
> up with the below dataframe and obviously the bins created there don't
> correspond to a chromStart change of 115341:
>
>   chrom chromStart chromEnd         name cumsum bin
> 1  chr1      10089    10309       ZBTB33  10089   1
> 2  chr1      10132    10536  TAF7_(SQ-8)  20221   2
> 3  chr2      10133    10362     Pol2-4H8  30354   3
> 4  chr2      10148    10418 MafF_(M8194)  40502   4
> 5  chr2     210382   210578       ZBTB33  50884   5
> 6  chr2     216132   216352         CTCF  67016   6
>
> the first two rows should have the same bin number (same chrom,
> <115341 diff), then rows 3&4 should be in another bin (different chrom
> from rows 1&2, <115341 diff), and rows 5&6 in another one (same chrom
> but >115341 difference between row 4 and row 5).
>
> it seems the new.bin line of your code isn't quite doing what it
> should but I can't pinpoint the error there...
> Paul
>
>
> On 2 July 2012 14:19, Jean V Adams <[hidden email]> wrote:
> > Paul,
> >
> > My interpretation is that you are trying to assign a new bin number to
a row
> > every time the variable chrom changes and every time the variable
chromStart
> > changes by 115341 or more.  Is that right?  If so, you don't need a
loop at
> > all.  Check out the code below.  I made a couple changes to the
all.tf7
> > example data frame so that it would have two changes in bin number,
one

> > based on the chrom variable and one based on the chromStart variable.
> >
> > Jean
> >
> > all.tf7 <- data.frame(
> >         chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
> >         chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
> >         chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
> >         name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8", "MafF_(M8194)",
> > "ZBTB33", "CTCF"),
> >         cumsum = c(10089, 20221, 30354, 40502, 50884, 67016),
> >         bin = rep(NA, 6)
> >         )
> >
> > # assign a new bin every time chrom changes and every time chromStart
> > changes by 115341 or more
> > L <- nrow(all.tf7)
> > prev.chrom <- c(NA, all.tf7$chrom[-L])
> > delta.start <- c(NA, all.tf7$chromStart[-1] - all.tf7$chromStart[-L])
> > new.bin <- is.na(prev.chrom) | all.tf7$chrom != prev.chrom |
delta.start >=

> > 115341
> > all.tf7$bin <- cumsum(new.bin)
> > all.tf7
> >
> >
> > pguilha <[hidden email]> wrote on 07/02/2012 06:25:13 AM:
> >
> >> Hello all,
> >>
> >> I have written a for loop to act on a dataframe with close to
3million
> >> rows
> >> and 6 columns and I would like to pass it to apply() to speed the
process
> >> up
> >> (I let the loop run for 2 days before stopping it and it had only
gone
> >> through 200,000 rows) but I am really struggling to find a way to
pass the
> >> arguments. Below are the loop and the head of the dataframe I am
working
> >> on.
> >> Any hints would be much appreciated, thank you! (I have searched for
this

> >> but could not find any other posts doing quite what I want)
> >> Paul
> >>
> >> x<-as.numeric(all.tf7[1,2])
> >> for (i in 2:nrow(all.tf7)) {
> >>   if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)<115341)
> >> all.tf7[i,6]<-all.tf7[i-1,6]
> >>   else if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)>=115341) {
> >>     all.tf7[i,6]<-(all.tf7[i-1,6]+1)
> >>     x<-as.numeric(all.tf7[i,2]) }
> >>   else if (all.tf7[i,1]!=all.tf7[i-1,1])  {
> >>     all.tf7[i,6]<-(all.tf7[i-1,6]+1)
> >>     x<-as.numeric(all.tf7[i,2]) }
> >> }
> >>
> >> #the aim here is to attribute a bin number to each row so that I can
then

> >> split the dataframe according to those bins.
> >>
> >>
> >> chrom chromStart chromEnd         name cumsum bin
> >> chr1      10089             10309               ZBTB33  10089   1
> >> chr1      10132             10536      TAF7_(SQ-8)  20221   1
> >> chr1      10133             10362            Pol2-4H8  30354   1
> >> chr1      10148             10418  MafF_(M8194)  40502   1
> >> chr1      10382             10578                ZBTB33  50884   1
> >> chr1      16132             16352                    CTCF  67016   1

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: apply with multiple conditions

pguilha
Jean, that's exactly what it should be, but yes I copied and pasted
from your email so I don't see how I could have introduced an error in
there....
paul

On 2 July 2012 15:57, Jean V Adams [via R]
<[hidden email]> wrote:

> Paul,
>
> Are you submitting the exact code that I included in my previous e-mail?
> When I submit that code, I get this ...
>
>   chrom chromStart chromEnd         name cumsum bin
> 1  chr1      10089    10309       ZBTB33  10089   1
> 2  chr1      10132    10536  TAF7_(SQ-8)  20221   1
> 3  chr2      10133    10362     Pol2-4H8  30354   2
> 4  chr2      10148    10418 MafF_(M8194)  40502   2
> 5  chr2     210382   210578       ZBTB33  50884   3
> 6  chr2     216132   216352         CTCF  67016   3
>
> Jean
>
>
> Paul Guilhamon <[hidden email]> wrote on 07/02/2012 08:59:00 AM:
>
>> Thanks for your reply Jean,
>>
>> I think your interpretation is correct but when I run your code I end
>> up with the below dataframe and obviously the bins created there don't
>> correspond to a chromStart change of 115341:
>>
>>   chrom chromStart chromEnd         name cumsum bin
>> 1  chr1      10089    10309       ZBTB33  10089   1
>> 2  chr1      10132    10536  TAF7_(SQ-8)  20221   2
>> 3  chr2      10133    10362     Pol2-4H8  30354   3
>> 4  chr2      10148    10418 MafF_(M8194)  40502   4
>> 5  chr2     210382   210578       ZBTB33  50884   5
>> 6  chr2     216132   216352         CTCF  67016   6
>>
>> the first two rows should have the same bin number (same chrom,
>> <115341 diff), then rows 3&4 should be in another bin (different chrom
>> from rows 1&2, <115341 diff), and rows 5&6 in another one (same chrom
>> but >115341 difference between row 4 and row 5).
>>
>> it seems the new.bin line of your code isn't quite doing what it
>> should but I can't pinpoint the error there...
>> Paul
>>
>>
>> On 2 July 2012 14:19, Jean V Adams <[hidden email]> wrote:
>> > Paul,
>> >
>> > My interpretation is that you are trying to assign a new bin number to
> a row
>> > every time the variable chrom changes and every time the variable
> chromStart
>> > changes by 115341 or more.  Is that right?  If so, you don't need a
> loop at
>> > all.  Check out the code below.  I made a couple changes to the
> all.tf7
>> > example data frame so that it would have two changes in bin number,
> one
>
>> > based on the chrom variable and one based on the chromStart variable.
>> >
>> > Jean
>> >
>> > all.tf7 <- data.frame(
>> >         chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
>> >         chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
>> >         chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
>> >         name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8", "MafF_(M8194)",
>> > "ZBTB33", "CTCF"),
>> >         cumsum = c(10089, 20221, 30354, 40502, 50884, 67016),
>> >         bin = rep(NA, 6)
>> >         )
>> >
>> > # assign a new bin every time chrom changes and every time chromStart
>> > changes by 115341 or more
>> > L <- nrow(all.tf7)
>> > prev.chrom <- c(NA, all.tf7$chrom[-L])
>> > delta.start <- c(NA, all.tf7$chromStart[-1] - all.tf7$chromStart[-L])
>> > new.bin <- is.na(prev.chrom) | all.tf7$chrom != prev.chrom |
> delta.start >=
>
>> > 115341
>> > all.tf7$bin <- cumsum(new.bin)
>> > all.tf7
>> >
>> >
>> > pguilha <[hidden email]> wrote on 07/02/2012 06:25:13 AM:
>> >
>> >> Hello all,
>> >>
>> >> I have written a for loop to act on a dataframe with close to
> 3million
>> >> rows
>> >> and 6 columns and I would like to pass it to apply() to speed the
> process
>> >> up
>> >> (I let the loop run for 2 days before stopping it and it had only
> gone
>> >> through 200,000 rows) but I am really struggling to find a way to
> pass the
>> >> arguments. Below are the loop and the head of the dataframe I am
> working
>> >> on.
>> >> Any hints would be much appreciated, thank you! (I have searched for
> this
>
>> >> but could not find any other posts doing quite what I want)
>> >> Paul
>> >>
>> >> x<-as.numeric(all.tf7[1,2])
>> >> for (i in 2:nrow(all.tf7)) {
>> >>   if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)<115341)
>> >> all.tf7[i,6]<-all.tf7[i-1,6]
>> >>   else if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)>=115341) {
>> >>     all.tf7[i,6]<-(all.tf7[i-1,6]+1)
>> >>     x<-as.numeric(all.tf7[i,2]) }
>> >>   else if (all.tf7[i,1]!=all.tf7[i-1,1])  {
>> >>     all.tf7[i,6]<-(all.tf7[i-1,6]+1)
>> >>     x<-as.numeric(all.tf7[i,2]) }
>> >> }
>> >>
>> >> #the aim here is to attribute a bin number to each row so that I can
> then
>
>> >> split the dataframe according to those bins.
>> >>
>> >>
>> >> chrom chromStart chromEnd         name cumsum bin
>> >> chr1      10089             10309               ZBTB33  10089   1
>> >> chr1      10132             10536      TAF7_(SQ-8)  20221   1
>> >> chr1      10133             10362            Pol2-4H8  30354   1
>> >> chr1      10148             10418  MafF_(M8194)  40502   1
>> >> chr1      10382             10578                ZBTB33  50884   1
>> >> chr1      16132             16352                    CTCF  67016   1
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion
> below:
> http://r.789695.n4.nabble.com/apply-with-multiple-conditions-tp4635098p4635144.html
> To unsubscribe from apply with multiple conditions, click here.
> NAML
Reply | Threaded
Open this post in threaded view
|

Re: apply with multiple conditions

Adams, Jean
Paul,

Try this (I changed some of the object names, but the meat of the code is
the same):

df <- data.frame(
        chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
        chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
        chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
        name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8", "MafF_(M8194)",
"ZBTB33", "CTCF"),
        cumsum = c(10089, 20221, 30354, 40502, 50884, 67016)
        )

# assign a new bin every time chrom changes and every time chromStart
changes by 115341 or more
L <- nrow(df)
prev.chrom <- c(NA, df$chrom[-L])
delta.start <- c(NA, df$chromStart[-1] - df$chromStart[-L])
new.bin <- is.na(prev.chrom) | df$chrom != prev.chrom | delta.start >=
115341
df$bin <- cumsum(new.bin)
df


pguilha <[hidden email]> wrote on 07/02/2012 10:23:36 AM:

> Jean, that's exactly what it should be, but yes I copied and pasted
> from your email so I don't see how I could have introduced an error in
> there....
> paul
>
> On 2 July 2012 15:57, Jean V Adams [via R]
> <[hidden email]> wrote:
> > Paul,
> >
> > Are you submitting the exact code that I included in my previous
e-mail?

> > When I submit that code, I get this ...
> >
> >   chrom chromStart chromEnd         name cumsum bin
> > 1  chr1      10089    10309       ZBTB33  10089   1
> > 2  chr1      10132    10536  TAF7_(SQ-8)  20221   1
> > 3  chr2      10133    10362     Pol2-4H8  30354   2
> > 4  chr2      10148    10418 MafF_(M8194)  40502   2
> > 5  chr2     210382   210578       ZBTB33  50884   3
> > 6  chr2     216132   216352         CTCF  67016   3
> >
> > Jean
> >
> >
> > Paul Guilhamon <[hidden email]> wrote on 07/02/2012 08:59:00 AM:
> >
> >> Thanks for your reply Jean,
> >>
> >> I think your interpretation is correct but when I run your code I end
> >> up with the below dataframe and obviously the bins created there
don't

> >> correspond to a chromStart change of 115341:
> >>
> >>   chrom chromStart chromEnd         name cumsum bin
> >> 1  chr1      10089    10309       ZBTB33  10089   1
> >> 2  chr1      10132    10536  TAF7_(SQ-8)  20221   2
> >> 3  chr2      10133    10362     Pol2-4H8  30354   3
> >> 4  chr2      10148    10418 MafF_(M8194)  40502   4
> >> 5  chr2     210382   210578       ZBTB33  50884   5
> >> 6  chr2     216132   216352         CTCF  67016   6
> >>
> >> the first two rows should have the same bin number (same chrom,
> >> <115341 diff), then rows 3&4 should be in another bin (different
chrom

> >> from rows 1&2, <115341 diff), and rows 5&6 in another one (same chrom
> >> but >115341 difference between row 4 and row 5).
> >>
> >> it seems the new.bin line of your code isn't quite doing what it
> >> should but I can't pinpoint the error there...
> >> Paul
> >>
> >>
> >> On 2 July 2012 14:19, Jean V Adams <[hidden email]> wrote:
> >> > Paul,
> >> >
> >> > My interpretation is that you are trying to assign a new bin number
to

> > a row
> >> > every time the variable chrom changes and every time the variable
> > chromStart
> >> > changes by 115341 or more.  Is that right?  If so, you don't need a
> > loop at
> >> > all.  Check out the code below.  I made a couple changes to the
> > all.tf7
> >> > example data frame so that it would have two changes in bin number,
> > one
> >
> >> > based on the chrom variable and one based on the chromStart
variable.
> >> >
> >> > Jean
> >> >
> >> > all.tf7 <- data.frame(
> >> >         chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
> >> >         chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
> >> >         chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
> >> >         name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8",
"MafF_(M8194)",
> >> > "ZBTB33", "CTCF"),
> >> >         cumsum = c(10089, 20221, 30354, 40502, 50884, 67016),
> >> >         bin = rep(NA, 6)
> >> >         )
> >> >
> >> > # assign a new bin every time chrom changes and every time
chromStart
> >> > changes by 115341 or more
> >> > L <- nrow(all.tf7)
> >> > prev.chrom <- c(NA, all.tf7$chrom[-L])
> >> > delta.start <- c(NA, all.tf7$chromStart[-1] -
all.tf7$chromStart[-L])

> >> > new.bin <- is.na(prev.chrom) | all.tf7$chrom != prev.chrom |
> > delta.start >=
> >
> >> > 115341
> >> > all.tf7$bin <- cumsum(new.bin)
> >> > all.tf7
> >> >
> >> >
> >> > pguilha <[hidden email]> wrote on 07/02/2012 06:25:13 AM:
> >> >
> >> >> Hello all,
> >> >>
> >> >> I have written a for loop to act on a dataframe with close to
> > 3million
> >> >> rows
> >> >> and 6 columns and I would like to pass it to apply() to speed the
> > process
> >> >> up
> >> >> (I let the loop run for 2 days before stopping it and it had only
> > gone
> >> >> through 200,000 rows) but I am really struggling to find a way to
> > pass the
> >> >> arguments. Below are the loop and the head of the dataframe I am
> > working
> >> >> on.
> >> >> Any hints would be much appreciated, thank you! (I have searched
for

> > this
> >
> >> >> but could not find any other posts doing quite what I want)
> >> >> Paul
> >> >>
> >> >> x<-as.numeric(all.tf7[1,2])
> >> >> for (i in 2:nrow(all.tf7)) {
> >> >>   if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)<115341)
> >> >> all.tf7[i,6]<-all.tf7[i-1,6]
> >> >>   else if (all.tf7[i,1]==all.tf7[i-1,1] &
(all.tf7[i,2]-x)>=115341) {
> >> >>     all.tf7[i,6]<-(all.tf7[i-1,6]+1)
> >> >>     x<-as.numeric(all.tf7[i,2]) }
> >> >>   else if (all.tf7[i,1]!=all.tf7[i-1,1])  {
> >> >>     all.tf7[i,6]<-(all.tf7[i-1,6]+1)
> >> >>     x<-as.numeric(all.tf7[i,2]) }
> >> >> }
> >> >>
> >> >> #the aim here is to attribute a bin number to each row so that I
can

> > then
> >
> >> >> split the dataframe according to those bins.
> >> >>
> >> >>
> >> >> chrom chromStart chromEnd         name cumsum bin
> >> >> chr1      10089             10309               ZBTB33  10089   1
> >> >> chr1      10132             10536      TAF7_(SQ-8)  20221   1
> >> >> chr1      10133             10362            Pol2-4H8  30354   1
> >> >> chr1      10148             10418  MafF_(M8194)  40502   1
> >> >> chr1      10382             10578                ZBTB33  50884   1
> >> >> chr1      16132             16352                    CTCF  67016 1
        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: apply with multiple conditions

pguilha
Jean,
It's crazy, I'm still getting 1,2,3,4,5,6 in the bin column.....
Also (this is an unrelated problem i think), unless I've misunderstood
it, I think your code will only create a new bin if the difference
between chromStart at i and i-1 position is >=115341....What I want is
for a new bin to be created each time the difference between
chromStart at i and i-j is >=115341, where 'i-j' corresponds to the
first row of the last bin....Im not sure if I'm being
clear...chromStart values correspond to coordinates along a chromosome
so I want to basically cut up each chromosome into sections/bins of
approximately 115341...

thanks again for all your efforts with this, they're much appreciated!
Paul

On 2 July 2012 19:36, Jean V Adams [via R]
<[hidden email]> wrote:

> Paul,
>
> Try this (I changed some of the object names, but the meat of the code is
> the same):
>
> df <- data.frame(
>         chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
>         chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
>         chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
>         name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8", "MafF_(M8194)",
> "ZBTB33", "CTCF"),
>         cumsum = c(10089, 20221, 30354, 40502, 50884, 67016)
>         )
>
> # assign a new bin every time chrom changes and every time chromStart
> changes by 115341 or more
> L <- nrow(df)
> prev.chrom <- c(NA, df$chrom[-L])
> delta.start <- c(NA, df$chromStart[-1] - df$chromStart[-L])
> new.bin <- is.na(prev.chrom) | df$chrom != prev.chrom | delta.start >=
> 115341
> df$bin <- cumsum(new.bin)
> df
>
>
> pguilha <[hidden email]> wrote on 07/02/2012 10:23:36 AM:
>
>> Jean, that's exactly what it should be, but yes I copied and pasted
>> from your email so I don't see how I could have introduced an error in
>> there....
>> paul
>>
>> On 2 July 2012 15:57, Jean V Adams [via R]
>> <[hidden email]> wrote:
>> > Paul,
>> >
>> > Are you submitting the exact code that I included in my previous
> e-mail?
>
>> > When I submit that code, I get this ...
>> >
>> >   chrom chromStart chromEnd         name cumsum bin
>> > 1  chr1      10089    10309       ZBTB33  10089   1
>> > 2  chr1      10132    10536  TAF7_(SQ-8)  20221   1
>> > 3  chr2      10133    10362     Pol2-4H8  30354   2
>> > 4  chr2      10148    10418 MafF_(M8194)  40502   2
>> > 5  chr2     210382   210578       ZBTB33  50884   3
>> > 6  chr2     216132   216352         CTCF  67016   3
>> >
>> > Jean
>> >
>> >
>> > Paul Guilhamon <[hidden email]> wrote on 07/02/2012 08:59:00 AM:
>> >
>> >> Thanks for your reply Jean,
>> >>
>> >> I think your interpretation is correct but when I run your code I end
>> >> up with the below dataframe and obviously the bins created there
> don't
>
>> >> correspond to a chromStart change of 115341:
>> >>
>> >>   chrom chromStart chromEnd         name cumsum bin
>> >> 1  chr1      10089    10309       ZBTB33  10089   1
>> >> 2  chr1      10132    10536  TAF7_(SQ-8)  20221   2
>> >> 3  chr2      10133    10362     Pol2-4H8  30354   3
>> >> 4  chr2      10148    10418 MafF_(M8194)  40502   4
>> >> 5  chr2     210382   210578       ZBTB33  50884   5
>> >> 6  chr2     216132   216352         CTCF  67016   6
>> >>
>> >> the first two rows should have the same bin number (same chrom,
>> >> <115341 diff), then rows 3&4 should be in another bin (different
> chrom
>
>> >> from rows 1&2, <115341 diff), and rows 5&6 in another one (same chrom
>> >> but >115341 difference between row 4 and row 5).
>> >>
>> >> it seems the new.bin line of your code isn't quite doing what it
>> >> should but I can't pinpoint the error there...
>> >> Paul
>> >>
>> >>
>> >> On 2 July 2012 14:19, Jean V Adams <[hidden email]> wrote:
>> >> > Paul,
>> >> >
>> >> > My interpretation is that you are trying to assign a new bin number
> to
>
>> > a row
>> >> > every time the variable chrom changes and every time the variable
>> > chromStart
>> >> > changes by 115341 or more.  Is that right?  If so, you don't need a
>> > loop at
>> >> > all.  Check out the code below.  I made a couple changes to the
>> > all.tf7
>> >> > example data frame so that it would have two changes in bin number,
>> > one
>> >
>> >> > based on the chrom variable and one based on the chromStart
> variable.
>> >> >
>> >> > Jean
>> >> >
>> >> > all.tf7 <- data.frame(
>> >> >         chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
>> >> >         chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
>> >> >         chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
>> >> >         name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8",
> "MafF_(M8194)",
>> >> > "ZBTB33", "CTCF"),
>> >> >         cumsum = c(10089, 20221, 30354, 40502, 50884, 67016),
>> >> >         bin = rep(NA, 6)
>> >> >         )
>> >> >
>> >> > # assign a new bin every time chrom changes and every time
> chromStart
>> >> > changes by 115341 or more
>> >> > L <- nrow(all.tf7)
>> >> > prev.chrom <- c(NA, all.tf7$chrom[-L])
>> >> > delta.start <- c(NA, all.tf7$chromStart[-1] -
> all.tf7$chromStart[-L])
>
>> >> > new.bin <- is.na(prev.chrom) | all.tf7$chrom != prev.chrom |
>> > delta.start >=
>> >
>> >> > 115341
>> >> > all.tf7$bin <- cumsum(new.bin)
>> >> > all.tf7
>> >> >
>> >> >
>> >> > pguilha <[hidden email]> wrote on 07/02/2012 06:25:13 AM:
>> >> >
>> >> >> Hello all,
>> >> >>
>> >> >> I have written a for loop to act on a dataframe with close to
>> > 3million
>> >> >> rows
>> >> >> and 6 columns and I would like to pass it to apply() to speed the
>> > process
>> >> >> up
>> >> >> (I let the loop run for 2 days before stopping it and it had only
>> > gone
>> >> >> through 200,000 rows) but I am really struggling to find a way to
>> > pass the
>> >> >> arguments. Below are the loop and the head of the dataframe I am
>> > working
>> >> >> on.
>> >> >> Any hints would be much appreciated, thank you! (I have searched
> for
>
>> > this
>> >
>> >> >> but could not find any other posts doing quite what I want)
>> >> >> Paul
>> >> >>
>> >> >> x<-as.numeric(all.tf7[1,2])
>> >> >> for (i in 2:nrow(all.tf7)) {
>> >> >>   if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)<115341)
>> >> >> all.tf7[i,6]<-all.tf7[i-1,6]
>> >> >>   else if (all.tf7[i,1]==all.tf7[i-1,1] &
> (all.tf7[i,2]-x)>=115341) {
>> >> >>     all.tf7[i,6]<-(all.tf7[i-1,6]+1)
>> >> >>     x<-as.numeric(all.tf7[i,2]) }
>> >> >>   else if (all.tf7[i,1]!=all.tf7[i-1,1])  {
>> >> >>     all.tf7[i,6]<-(all.tf7[i-1,6]+1)
>> >> >>     x<-as.numeric(all.tf7[i,2]) }
>> >> >> }
>> >> >>
>> >> >> #the aim here is to attribute a bin number to each row so that I
> can
>
>> > then
>> >
>> >> >> split the dataframe according to those bins.
>> >> >>
>> >> >>
>> >> >> chrom chromStart chromEnd         name cumsum bin
>> >> >> chr1      10089             10309               ZBTB33  10089   1
>> >> >> chr1      10132             10536      TAF7_(SQ-8)  20221   1
>> >> >> chr1      10133             10362            Pol2-4H8  30354   1
>> >> >> chr1      10148             10418  MafF_(M8194)  40502   1
>> >> >> chr1      10382             10578                ZBTB33  50884   1
>> >> >> chr1      16132             16352                    CTCF  67016 1
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion
> below:
> http://r.789695.n4.nabble.com/apply-with-multiple-conditions-tp4635098p4635185.html
> To unsubscribe from apply with multiple conditions, click here.
> NAML
Reply | Threaded
Open this post in threaded view
|

Re: apply with multiple conditions

Rui Barradas
Hello,

Sorry to intrude, but I think it's a factor issue.
Try the changing the disjunction to, (in multiline edit)


new.bin <- is.na(prev.chrom) |
                df$chrom != levels(df$chrom)[prev.chrom] |
                delta.start >= 115341

It should work, now.

Hope this helps,

Rui Barradas

Em 02-07-2012 20:03, pguilha escreveu:

> Jean,
> It's crazy, I'm still getting 1,2,3,4,5,6 in the bin column.....
> Also (this is an unrelated problem i think), unless I've misunderstood
> it, I think your code will only create a new bin if the difference
> between chromStart at i and i-1 position is >=115341....What I want is
> for a new bin to be created each time the difference between
> chromStart at i and i-j is >=115341, where 'i-j' corresponds to the
> first row of the last bin....Im not sure if I'm being
> clear...chromStart values correspond to coordinates along a chromosome
> so I want to basically cut up each chromosome into sections/bins of
> approximately 115341...
>
> thanks again for all your efforts with this, they're much appreciated!
> Paul
>
> On 2 July 2012 19:36, Jean V Adams [via R]
> <[hidden email]> wrote:
>> Paul,
>>
>> Try this (I changed some of the object names, but the meat of the code is
>> the same):
>>
>> df <- data.frame(
>>          chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
>>          chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
>>          chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
>>          name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8", "MafF_(M8194)",
>> "ZBTB33", "CTCF"),
>>          cumsum = c(10089, 20221, 30354, 40502, 50884, 67016)
>>          )
>>
>> # assign a new bin every time chrom changes and every time chromStart
>> changes by 115341 or more
>> L <- nrow(df)
>> prev.chrom <- c(NA, df$chrom[-L])
>> delta.start <- c(NA, df$chromStart[-1] - df$chromStart[-L])
>> new.bin <- is.na(prev.chrom) | df$chrom != prev.chrom | delta.start >=
>> 115341
>> df$bin <- cumsum(new.bin)
>> df
>>
>>
>> pguilha <[hidden email]> wrote on 07/02/2012 10:23:36 AM:
>>
>>> Jean, that's exactly what it should be, but yes I copied and pasted
>>> from your email so I don't see how I could have introduced an error in
>>> there....
>>> paul
>>>
>>> On 2 July 2012 15:57, Jean V Adams [via R]
>>> <[hidden email]> wrote:
>>>> Paul,
>>>>
>>>> Are you submitting the exact code that I included in my previous
>> e-mail?
>>
>>>> When I submit that code, I get this ...
>>>>
>>>>    chrom chromStart chromEnd         name cumsum bin
>>>> 1  chr1      10089    10309       ZBTB33  10089   1
>>>> 2  chr1      10132    10536  TAF7_(SQ-8)  20221   1
>>>> 3  chr2      10133    10362     Pol2-4H8  30354   2
>>>> 4  chr2      10148    10418 MafF_(M8194)  40502   2
>>>> 5  chr2     210382   210578       ZBTB33  50884   3
>>>> 6  chr2     216132   216352         CTCF  67016   3
>>>>
>>>> Jean
>>>>
>>>>
>>>> Paul Guilhamon <[hidden email]> wrote on 07/02/2012 08:59:00 AM:
>>>>
>>>>> Thanks for your reply Jean,
>>>>>
>>>>> I think your interpretation is correct but when I run your code I end
>>>>> up with the below dataframe and obviously the bins created there
>> don't
>>
>>>>> correspond to a chromStart change of 115341:
>>>>>
>>>>>    chrom chromStart chromEnd         name cumsum bin
>>>>> 1  chr1      10089    10309       ZBTB33  10089   1
>>>>> 2  chr1      10132    10536  TAF7_(SQ-8)  20221   2
>>>>> 3  chr2      10133    10362     Pol2-4H8  30354   3
>>>>> 4  chr2      10148    10418 MafF_(M8194)  40502   4
>>>>> 5  chr2     210382   210578       ZBTB33  50884   5
>>>>> 6  chr2     216132   216352         CTCF  67016   6
>>>>>
>>>>> the first two rows should have the same bin number (same chrom,
>>>>> <115341 diff), then rows 3&4 should be in another bin (different
>> chrom
>>
>>>>> from rows 1&2, <115341 diff), and rows 5&6 in another one (same chrom
>>>>> but >115341 difference between row 4 and row 5).
>>>>>
>>>>> it seems the new.bin line of your code isn't quite doing what it
>>>>> should but I can't pinpoint the error there...
>>>>> Paul
>>>>>
>>>>>
>>>>> On 2 July 2012 14:19, Jean V Adams <[hidden email]> wrote:
>>>>>> Paul,
>>>>>>
>>>>>> My interpretation is that you are trying to assign a new bin number
>> to
>>
>>>> a row
>>>>>> every time the variable chrom changes and every time the variable
>>>> chromStart
>>>>>> changes by 115341 or more.  Is that right?  If so, you don't need a
>>>> loop at
>>>>>> all.  Check out the code below.  I made a couple changes to the
>>>> all.tf7
>>>>>> example data frame so that it would have two changes in bin number,
>>>> one
>>>>
>>>>>> based on the chrom variable and one based on the chromStart
>> variable.
>>>>>>
>>>>>> Jean
>>>>>>
>>>>>> all.tf7 <- data.frame(
>>>>>>          chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
>>>>>>          chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
>>>>>>          chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
>>>>>>          name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8",
>> "MafF_(M8194)",
>>>>>> "ZBTB33", "CTCF"),
>>>>>>          cumsum = c(10089, 20221, 30354, 40502, 50884, 67016),
>>>>>>          bin = rep(NA, 6)
>>>>>>          )
>>>>>>
>>>>>> # assign a new bin every time chrom changes and every time
>> chromStart
>>>>>> changes by 115341 or more
>>>>>> L <- nrow(all.tf7)
>>>>>> prev.chrom <- c(NA, all.tf7$chrom[-L])
>>>>>> delta.start <- c(NA, all.tf7$chromStart[-1] -
>> all.tf7$chromStart[-L])
>>
>>>>>> new.bin <- is.na(prev.chrom) | all.tf7$chrom != prev.chrom |
>>>> delta.start >=
>>>>
>>>>>> 115341
>>>>>> all.tf7$bin <- cumsum(new.bin)
>>>>>> all.tf7
>>>>>>
>>>>>>
>>>>>> pguilha <[hidden email]> wrote on 07/02/2012 06:25:13 AM:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> I have written a for loop to act on a dataframe with close to
>>>> 3million
>>>>>>> rows
>>>>>>> and 6 columns and I would like to pass it to apply() to speed the
>>>> process
>>>>>>> up
>>>>>>> (I let the loop run for 2 days before stopping it and it had only
>>>> gone
>>>>>>> through 200,000 rows) but I am really struggling to find a way to
>>>> pass the
>>>>>>> arguments. Below are the loop and the head of the dataframe I am
>>>> working
>>>>>>> on.
>>>>>>> Any hints would be much appreciated, thank you! (I have searched
>> for
>>
>>>> this
>>>>
>>>>>>> but could not find any other posts doing quite what I want)
>>>>>>> Paul
>>>>>>>
>>>>>>> x<-as.numeric(all.tf7[1,2])
>>>>>>> for (i in 2:nrow(all.tf7)) {
>>>>>>>    if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)<115341)
>>>>>>> all.tf7[i,6]<-all.tf7[i-1,6]
>>>>>>>    else if (all.tf7[i,1]==all.tf7[i-1,1] &
>> (all.tf7[i,2]-x)>=115341) {
>>>>>>>      all.tf7[i,6]<-(all.tf7[i-1,6]+1)
>>>>>>>      x<-as.numeric(all.tf7[i,2]) }
>>>>>>>    else if (all.tf7[i,1]!=all.tf7[i-1,1])  {
>>>>>>>      all.tf7[i,6]<-(all.tf7[i-1,6]+1)
>>>>>>>      x<-as.numeric(all.tf7[i,2]) }
>>>>>>> }
>>>>>>>
>>>>>>> #the aim here is to attribute a bin number to each row so that I
>> can
>>
>>>> then
>>>>
>>>>>>> split the dataframe according to those bins.
>>>>>>>
>>>>>>>
>>>>>>> chrom chromStart chromEnd         name cumsum bin
>>>>>>> chr1      10089             10309               ZBTB33  10089   1
>>>>>>> chr1      10132             10536      TAF7_(SQ-8)  20221   1
>>>>>>> chr1      10133             10362            Pol2-4H8  30354   1
>>>>>>> chr1      10148             10418  MafF_(M8194)  40502   1
>>>>>>> chr1      10382             10578                ZBTB33  50884   1
>>>>>>> chr1      16132             16352                    CTCF  67016 1
>>          [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> ________________________________
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://r.789695.n4.nabble.com/apply-with-multiple-conditions-tp4635098p4635185.html
>> To unsubscribe from apply with multiple conditions, click here.
>> NAML
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/apply-with-multiple-conditions-tp4635098p4635189.html
> Sent from the R help mailing list archive at Nabble.com.
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: apply with multiple conditions

Adams, Jean
Thanks for the intrusion!  I have
        options(stringsAsFactors=FALSE)
and Paul probably doesn't, so he saw factors where I saw characters.

Paul,

I saw your other note ... try this code

L <- nrow(df)
# assign a new bin every time chrom changes
prev.chrom <- c(NA, df$chrom[-L])
bin1 <- cumsum(is.na(prev.chrom) | df$chrom !=
levels(df$chrom)[prev.chrom])

# substract the minimum chromStart from each bin
min.start <- tapply(df$chromStart, bin1, min, na.rm=TRUE)[bin1]

# split bins further if chromStart >= 115341 + min.start
bin2 <- floor((df$chromStart - min.start) / 115341)

# combine the two bins into one
df$bin <- interaction(bin1, bin2)

df


Jean



Rui Barradas <[hidden email]> wrote on 07/02/2012 02:24:43 PM:

> Hello,
>
> Sorry to intrude, but I think it's a factor issue.
> Try the changing the disjunction to, (in multiline edit)
>
>
> new.bin <- is.na(prev.chrom) |
>       df$chrom != levels(df$chrom)[prev.chrom] |
>       delta.start >= 115341
>
> It should work, now.
>
> Hope this helps,
>
> Rui Barradas
>
> Em 02-07-2012 20:03, pguilha escreveu:
> > Jean,
> > It's crazy, I'm still getting 1,2,3,4,5,6 in the bin column.....
> > Also (this is an unrelated problem i think), unless I've misunderstood
> > it, I think your code will only create a new bin if the difference
> > between chromStart at i and i-1 position is >=115341....What I want is
> > for a new bin to be created each time the difference between
> > chromStart at i and i-j is >=115341, where 'i-j' corresponds to the
> > first row of the last bin....Im not sure if I'm being
> > clear...chromStart values correspond to coordinates along a chromosome
> > so I want to basically cut up each chromosome into sections/bins of
> > approximately 115341...
> >
> > thanks again for all your efforts with this, they're much appreciated!
> > Paul
> >
> > On 2 July 2012 19:36, Jean V Adams [via R]
> > <[hidden email]> wrote:
> >> Paul,
> >>
> >> Try this (I changed some of the object names, but the meat of the
code is
> >> the same):
> >>
> >> df <- data.frame(
> >>          chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
> >>          chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
> >>          chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
> >>          name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8",
"MafF_(M8194)",

> >> "ZBTB33", "CTCF"),
> >>          cumsum = c(10089, 20221, 30354, 40502, 50884, 67016)
> >>          )
> >>
> >> # assign a new bin every time chrom changes and every time chromStart
> >> changes by 115341 or more
> >> L <- nrow(df)
> >> prev.chrom <- c(NA, df$chrom[-L])
> >> delta.start <- c(NA, df$chromStart[-1] - df$chromStart[-L])
> >> new.bin <- is.na(prev.chrom) | df$chrom != prev.chrom | delta.start
>=
> >> 115341
> >> df$bin <- cumsum(new.bin)
> >> df
> >>
> >>
> >> pguilha <[hidden email]> wrote on 07/02/2012 10:23:36 AM:
> >>
> >>> Jean, that's exactly what it should be, but yes I copied and pasted
> >>> from your email so I don't see how I could have introduced an error
in

> >>> there....
> >>> paul
> >>>
> >>> On 2 July 2012 15:57, Jean V Adams [via R]
> >>> <[hidden email]> wrote:
> >>>> Paul,
> >>>>
> >>>> Are you submitting the exact code that I included in my previous
> >> e-mail?
> >>
> >>>> When I submit that code, I get this ...
> >>>>
> >>>>    chrom chromStart chromEnd         name cumsum bin
> >>>> 1  chr1      10089    10309       ZBTB33  10089   1
> >>>> 2  chr1      10132    10536  TAF7_(SQ-8)  20221   1
> >>>> 3  chr2      10133    10362     Pol2-4H8  30354   2
> >>>> 4  chr2      10148    10418 MafF_(M8194)  40502   2
> >>>> 5  chr2     210382   210578       ZBTB33  50884   3
> >>>> 6  chr2     216132   216352         CTCF  67016   3
> >>>>
> >>>> Jean
> >>>>
> >>>>
> >>>> Paul Guilhamon <[hidden email]> wrote on 07/02/2012 08:59:00 AM:
> >>>>
> >>>>> Thanks for your reply Jean,
> >>>>>
> >>>>> I think your interpretation is correct but when I run your code I
end

> >>>>> up with the below dataframe and obviously the bins created there
> >> don't
> >>
> >>>>> correspond to a chromStart change of 115341:
> >>>>>
> >>>>>    chrom chromStart chromEnd         name cumsum bin
> >>>>> 1  chr1      10089    10309       ZBTB33  10089   1
> >>>>> 2  chr1      10132    10536  TAF7_(SQ-8)  20221   2
> >>>>> 3  chr2      10133    10362     Pol2-4H8  30354   3
> >>>>> 4  chr2      10148    10418 MafF_(M8194)  40502   4
> >>>>> 5  chr2     210382   210578       ZBTB33  50884   5
> >>>>> 6  chr2     216132   216352         CTCF  67016   6
> >>>>>
> >>>>> the first two rows should have the same bin number (same chrom,
> >>>>> <115341 diff), then rows 3&4 should be in another bin (different
> >> chrom
> >>
> >>>>> from rows 1&2, <115341 diff), and rows 5&6 in another one (same
chrom

> >>>>> but >115341 difference between row 4 and row 5).
> >>>>>
> >>>>> it seems the new.bin line of your code isn't quite doing what it
> >>>>> should but I can't pinpoint the error there...
> >>>>> Paul
> >>>>>
> >>>>>
> >>>>> On 2 July 2012 14:19, Jean V Adams <[hidden email]> wrote:
> >>>>>> Paul,
> >>>>>>
> >>>>>> My interpretation is that you are trying to assign a new bin
number
> >> to
> >>
> >>>> a row
> >>>>>> every time the variable chrom changes and every time the variable
> >>>> chromStart
> >>>>>> changes by 115341 or more.  Is that right?  If so, you don't need
a
> >>>> loop at
> >>>>>> all.  Check out the code below.  I made a couple changes to the
> >>>> all.tf7
> >>>>>> example data frame so that it would have two changes in bin
number,
> >>>> one
> >>>>
> >>>>>> based on the chrom variable and one based on the chromStart
> >> variable.
> >>>>>>
> >>>>>> Jean
> >>>>>>
> >>>>>> all.tf7 <- data.frame(
> >>>>>>          chrom = c("chr1", "chr1", "chr2", "chr2", "chr2",
"chr2"),
> >>>>>>          chromStart = c(10089, 10132, 10133, 10148, 210382,
216132),
> >>>>>>          chromEnd = c(10309, 10536, 10362, 10418, 210578,
216352),

> >>>>>>          name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8",
> >> "MafF_(M8194)",
> >>>>>> "ZBTB33", "CTCF"),
> >>>>>>          cumsum = c(10089, 20221, 30354, 40502, 50884, 67016),
> >>>>>>          bin = rep(NA, 6)
> >>>>>>          )
> >>>>>>
> >>>>>> # assign a new bin every time chrom changes and every time
> >> chromStart
> >>>>>> changes by 115341 or more
> >>>>>> L <- nrow(all.tf7)
> >>>>>> prev.chrom <- c(NA, all.tf7$chrom[-L])
> >>>>>> delta.start <- c(NA, all.tf7$chromStart[-1] -
> >> all.tf7$chromStart[-L])
> >>
> >>>>>> new.bin <- is.na(prev.chrom) | all.tf7$chrom != prev.chrom |
> >>>> delta.start >=
> >>>>
> >>>>>> 115341
> >>>>>> all.tf7$bin <- cumsum(new.bin)
> >>>>>> all.tf7
> >>>>>>
> >>>>>>
> >>>>>> pguilha <[hidden email]> wrote on 07/02/2012 06:25:13 AM:
> >>>>>>
> >>>>>>> Hello all,
> >>>>>>>
> >>>>>>> I have written a for loop to act on a dataframe with close to
> >>>> 3million
> >>>>>>> rows
> >>>>>>> and 6 columns and I would like to pass it to apply() to speed
the
> >>>> process
> >>>>>>> up
> >>>>>>> (I let the loop run for 2 days before stopping it and it had
only
> >>>> gone
> >>>>>>> through 200,000 rows) but I am really struggling to find a way
to

> >>>> pass the
> >>>>>>> arguments. Below are the loop and the head of the dataframe I am
> >>>> working
> >>>>>>> on.
> >>>>>>> Any hints would be much appreciated, thank you! (I have searched
> >> for
> >>
> >>>> this
> >>>>
> >>>>>>> but could not find any other posts doing quite what I want)
> >>>>>>> Paul
> >>>>>>>
> >>>>>>> x<-as.numeric(all.tf7[1,2])
> >>>>>>> for (i in 2:nrow(all.tf7)) {
> >>>>>>>    if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)<115341)
> >>>>>>> all.tf7[i,6]<-all.tf7[i-1,6]
> >>>>>>>    else if (all.tf7[i,1]==all.tf7[i-1,1] &
> >> (all.tf7[i,2]-x)>=115341) {
> >>>>>>>      all.tf7[i,6]<-(all.tf7[i-1,6]+1)
> >>>>>>>      x<-as.numeric(all.tf7[i,2]) }
> >>>>>>>    else if (all.tf7[i,1]!=all.tf7[i-1,1])  {
> >>>>>>>      all.tf7[i,6]<-(all.tf7[i-1,6]+1)
> >>>>>>>      x<-as.numeric(all.tf7[i,2]) }
> >>>>>>> }
> >>>>>>>
> >>>>>>> #the aim here is to attribute a bin number to each row so that I
> >> can
> >>
> >>>> then
> >>>>
> >>>>>>> split the dataframe according to those bins.
> >>>>>>>
> >>>>>>>
> >>>>>>> chrom chromStart chromEnd         name cumsum bin
> >>>>>>> chr1      10089             10309               ZBTB33  10089 1
> >>>>>>> chr1      10132             10536      TAF7_(SQ-8)  20221   1
> >>>>>>> chr1      10133             10362            Pol2-4H8  30354   1
> >>>>>>> chr1      10148             10418  MafF_(M8194)  40502   1
> >>>>>>> chr1      10382             10578                ZBTB33  50884 1
> >>>>>>> chr1      16132             16352                    CTCF  67016
1
        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: apply with multiple conditions

pguilha
Thank you both very much! the factor issue was indeed solved by your
modifications, and Jean that last bit of code does exactly what I
need. Perfect!
thanks again
paul

On 2 July 2012 21:34, Jean V Adams [via R]
<[hidden email]> wrote:

> Thanks for the intrusion!  I have
>         options(stringsAsFactors=FALSE)
> and Paul probably doesn't, so he saw factors where I saw characters.
>
> Paul,
>
> I saw your other note ... try this code
>
> L <- nrow(df)
> # assign a new bin every time chrom changes
> prev.chrom <- c(NA, df$chrom[-L])
> bin1 <- cumsum(is.na(prev.chrom) | df$chrom !=
> levels(df$chrom)[prev.chrom])
>
> # substract the minimum chromStart from each bin
> min.start <- tapply(df$chromStart, bin1, min, na.rm=TRUE)[bin1]
>
> # split bins further if chromStart >= 115341 + min.start
> bin2 <- floor((df$chromStart - min.start) / 115341)
>
> # combine the two bins into one
> df$bin <- interaction(bin1, bin2)
>
> df
>
>
> Jean
>
>
>
> Rui Barradas <[hidden email]> wrote on 07/02/2012 02:24:43 PM:
>
>> Hello,
>>
>> Sorry to intrude, but I think it's a factor issue.
>> Try the changing the disjunction to, (in multiline edit)
>>
>>
>> new.bin <- is.na(prev.chrom) |
>>       df$chrom != levels(df$chrom)[prev.chrom] |
>>       delta.start >= 115341
>>
>> It should work, now.
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
>> Em 02-07-2012 20:03, pguilha escreveu:
>> > Jean,
>> > It's crazy, I'm still getting 1,2,3,4,5,6 in the bin column.....
>> > Also (this is an unrelated problem i think), unless I've misunderstood
>> > it, I think your code will only create a new bin if the difference
>> > between chromStart at i and i-1 position is >=115341....What I want is
>> > for a new bin to be created each time the difference between
>> > chromStart at i and i-j is >=115341, where 'i-j' corresponds to the
>> > first row of the last bin....Im not sure if I'm being
>> > clear...chromStart values correspond to coordinates along a chromosome
>> > so I want to basically cut up each chromosome into sections/bins of
>> > approximately 115341...
>> >
>> > thanks again for all your efforts with this, they're much appreciated!
>> > Paul
>> >
>> > On 2 July 2012 19:36, Jean V Adams [via R]
>> > <[hidden email]> wrote:
>> >> Paul,
>> >>
>> >> Try this (I changed some of the object names, but the meat of the
> code is
>> >> the same):
>> >>
>> >> df <- data.frame(
>> >>          chrom = c("chr1", "chr1", "chr2", "chr2", "chr2", "chr2"),
>> >>          chromStart = c(10089, 10132, 10133, 10148, 210382, 216132),
>> >>          chromEnd = c(10309, 10536, 10362, 10418, 210578, 216352),
>> >>          name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8",
> "MafF_(M8194)",
>
>> >> "ZBTB33", "CTCF"),
>> >>          cumsum = c(10089, 20221, 30354, 40502, 50884, 67016)
>> >>          )
>> >>
>> >> # assign a new bin every time chrom changes and every time chromStart
>> >> changes by 115341 or more
>> >> L <- nrow(df)
>> >> prev.chrom <- c(NA, df$chrom[-L])
>> >> delta.start <- c(NA, df$chromStart[-1] - df$chromStart[-L])
>> >> new.bin <- is.na(prev.chrom) | df$chrom != prev.chrom | delta.start
>>=
>> >> 115341
>> >> df$bin <- cumsum(new.bin)
>> >> df
>> >>
>> >>
>> >> pguilha <[hidden email]> wrote on 07/02/2012 10:23:36 AM:
>> >>
>> >>> Jean, that's exactly what it should be, but yes I copied and pasted
>> >>> from your email so I don't see how I could have introduced an error
> in
>
>> >>> there....
>> >>> paul
>> >>>
>> >>> On 2 July 2012 15:57, Jean V Adams [via R]
>> >>> <[hidden email]> wrote:
>> >>>> Paul,
>> >>>>
>> >>>> Are you submitting the exact code that I included in my previous
>> >> e-mail?
>> >>
>> >>>> When I submit that code, I get this ...
>> >>>>
>> >>>>    chrom chromStart chromEnd         name cumsum bin
>> >>>> 1  chr1      10089    10309       ZBTB33  10089   1
>> >>>> 2  chr1      10132    10536  TAF7_(SQ-8)  20221   1
>> >>>> 3  chr2      10133    10362     Pol2-4H8  30354   2
>> >>>> 4  chr2      10148    10418 MafF_(M8194)  40502   2
>> >>>> 5  chr2     210382   210578       ZBTB33  50884   3
>> >>>> 6  chr2     216132   216352         CTCF  67016   3
>> >>>>
>> >>>> Jean
>> >>>>
>> >>>>
>> >>>> Paul Guilhamon <[hidden email]> wrote on 07/02/2012 08:59:00 AM:
>> >>>>
>> >>>>> Thanks for your reply Jean,
>> >>>>>
>> >>>>> I think your interpretation is correct but when I run your code I
> end
>
>> >>>>> up with the below dataframe and obviously the bins created there
>> >> don't
>> >>
>> >>>>> correspond to a chromStart change of 115341:
>> >>>>>
>> >>>>>    chrom chromStart chromEnd         name cumsum bin
>> >>>>> 1  chr1      10089    10309       ZBTB33  10089   1
>> >>>>> 2  chr1      10132    10536  TAF7_(SQ-8)  20221   2
>> >>>>> 3  chr2      10133    10362     Pol2-4H8  30354   3
>> >>>>> 4  chr2      10148    10418 MafF_(M8194)  40502   4
>> >>>>> 5  chr2     210382   210578       ZBTB33  50884   5
>> >>>>> 6  chr2     216132   216352         CTCF  67016   6
>> >>>>>
>> >>>>> the first two rows should have the same bin number (same chrom,
>> >>>>> <115341 diff), then rows 3&4 should be in another bin (different
>> >> chrom
>> >>
>> >>>>> from rows 1&2, <115341 diff), and rows 5&6 in another one (same
> chrom
>
>> >>>>> but >115341 difference between row 4 and row 5).
>> >>>>>
>> >>>>> it seems the new.bin line of your code isn't quite doing what it
>> >>>>> should but I can't pinpoint the error there...
>> >>>>> Paul
>> >>>>>
>> >>>>>
>> >>>>> On 2 July 2012 14:19, Jean V Adams <[hidden email]> wrote:
>> >>>>>> Paul,
>> >>>>>>
>> >>>>>> My interpretation is that you are trying to assign a new bin
> number
>> >> to
>> >>
>> >>>> a row
>> >>>>>> every time the variable chrom changes and every time the variable
>> >>>> chromStart
>> >>>>>> changes by 115341 or more.  Is that right?  If so, you don't need
> a
>> >>>> loop at
>> >>>>>> all.  Check out the code below.  I made a couple changes to the
>> >>>> all.tf7
>> >>>>>> example data frame so that it would have two changes in bin
> number,
>> >>>> one
>> >>>>
>> >>>>>> based on the chrom variable and one based on the chromStart
>> >> variable.
>> >>>>>>
>> >>>>>> Jean
>> >>>>>>
>> >>>>>> all.tf7 <- data.frame(
>> >>>>>>          chrom = c("chr1", "chr1", "chr2", "chr2", "chr2",
> "chr2"),
>> >>>>>>          chromStart = c(10089, 10132, 10133, 10148, 210382,
> 216132),
>> >>>>>>          chromEnd = c(10309, 10536, 10362, 10418, 210578,
> 216352),
>
>> >>>>>>          name = c("ZBTB33", "TAF7_(SQ-8)", "Pol2-4H8",
>> >> "MafF_(M8194)",
>> >>>>>> "ZBTB33", "CTCF"),
>> >>>>>>          cumsum = c(10089, 20221, 30354, 40502, 50884, 67016),
>> >>>>>>          bin = rep(NA, 6)
>> >>>>>>          )
>> >>>>>>
>> >>>>>> # assign a new bin every time chrom changes and every time
>> >> chromStart
>> >>>>>> changes by 115341 or more
>> >>>>>> L <- nrow(all.tf7)
>> >>>>>> prev.chrom <- c(NA, all.tf7$chrom[-L])
>> >>>>>> delta.start <- c(NA, all.tf7$chromStart[-1] -
>> >> all.tf7$chromStart[-L])
>> >>
>> >>>>>> new.bin <- is.na(prev.chrom) | all.tf7$chrom != prev.chrom |
>> >>>> delta.start >=
>> >>>>
>> >>>>>> 115341
>> >>>>>> all.tf7$bin <- cumsum(new.bin)
>> >>>>>> all.tf7
>> >>>>>>
>> >>>>>>
>> >>>>>> pguilha <[hidden email]> wrote on 07/02/2012 06:25:13 AM:
>> >>>>>>
>> >>>>>>> Hello all,
>> >>>>>>>
>> >>>>>>> I have written a for loop to act on a dataframe with close to
>> >>>> 3million
>> >>>>>>> rows
>> >>>>>>> and 6 columns and I would like to pass it to apply() to speed
> the
>> >>>> process
>> >>>>>>> up
>> >>>>>>> (I let the loop run for 2 days before stopping it and it had
> only
>> >>>> gone
>> >>>>>>> through 200,000 rows) but I am really struggling to find a way
> to
>
>> >>>> pass the
>> >>>>>>> arguments. Below are the loop and the head of the dataframe I am
>> >>>> working
>> >>>>>>> on.
>> >>>>>>> Any hints would be much appreciated, thank you! (I have searched
>> >> for
>> >>
>> >>>> this
>> >>>>
>> >>>>>>> but could not find any other posts doing quite what I want)
>> >>>>>>> Paul
>> >>>>>>>
>> >>>>>>> x<-as.numeric(all.tf7[1,2])
>> >>>>>>> for (i in 2:nrow(all.tf7)) {
>> >>>>>>>    if (all.tf7[i,1]==all.tf7[i-1,1] & (all.tf7[i,2]-x)<115341)
>> >>>>>>> all.tf7[i,6]<-all.tf7[i-1,6]
>> >>>>>>>    else if (all.tf7[i,1]==all.tf7[i-1,1] &
>> >> (all.tf7[i,2]-x)>=115341) {
>> >>>>>>>      all.tf7[i,6]<-(all.tf7[i-1,6]+1)
>> >>>>>>>      x<-as.numeric(all.tf7[i,2]) }
>> >>>>>>>    else if (all.tf7[i,1]!=all.tf7[i-1,1])  {
>> >>>>>>>      all.tf7[i,6]<-(all.tf7[i-1,6]+1)
>> >>>>>>>      x<-as.numeric(all.tf7[i,2]) }
>> >>>>>>> }
>> >>>>>>>
>> >>>>>>> #the aim here is to attribute a bin number to each row so that I
>> >> can
>> >>
>> >>>> then
>> >>>>
>> >>>>>>> split the dataframe according to those bins.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> chrom chromStart chromEnd         name cumsum bin
>> >>>>>>> chr1      10089             10309               ZBTB33  10089 1
>> >>>>>>> chr1      10132             10536      TAF7_(SQ-8)  20221   1
>> >>>>>>> chr1      10133             10362            Pol2-4H8  30354   1
>> >>>>>>> chr1      10148             10418  MafF_(M8194)  40502   1
>> >>>>>>> chr1      10382             10578                ZBTB33  50884 1
>> >>>>>>> chr1      16132             16352                    CTCF  67016
> 1
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ________________________________
> If you reply to this email, your message will be added to the discussion
> below:
> http://r.789695.n4.nabble.com/apply-with-multiple-conditions-tp4635098p4635200.html
> To unsubscribe from apply with multiple conditions, click here.
> NAML