subset only if f.e a column is successive for more than 3 values

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

subset only if f.e a column is successive for more than 3 values

Knut Krueger-8
Hi to all

I need a subset for values if there are f.e 3 values successive in a
column of a Data Frame:
Example from the subset help page:

subset(airquality, Temp > 80, select = c(Ozone, Temp))
29     45   81
35     NA   84
36     NA   85
38     29   82
39     NA   87
40     71   90
41     39   87
42     NA   93
43     NA   92
44     23   82
.....

I would like to get only

...
40     71   90
41     39   87
42     NA   93
43     NA   92
44     23   82
....

because the left column is ascending more than f.e three times without gap

Any hints for a package or do I need to build a own function?

Kind Regards Knut

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: subset only if f.e a column is successive for more than 3 values

Bert Gunter-2
1. I assume the values are integers, not floats/numerics (which woud make
it more complicated).

2. Strategy: Take differences (e.g. see ?diff) and look for >3 1's in a
row.

I don't have time to work out details, but perhaps that helps.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Thu, Sep 27, 2018 at 7:49 AM Knut Krueger <[hidden email]>
wrote:

> Hi to all
>
> I need a subset for values if there are f.e 3 values successive in a
> column of a Data Frame:
> Example from the subset help page:
>
> subset(airquality, Temp > 80, select = c(Ozone, Temp))
> 29     45   81
> 35     NA   84
> 36     NA   85
> 38     29   82
> 39     NA   87
> 40     71   90
> 41     39   87
> 42     NA   93
> 43     NA   92
> 44     23   82
> .....
>
> I would like to get only
>
> ...
> 40     71   90
> 41     39   87
> 42     NA   93
> 43     NA   92
> 44     23   82
> ....
>
> because the left column is ascending more than f.e three times without gap
>
> Any hints for a package or do I need to build a own function?
>
> Kind Regards Knut
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: subset only if f.e a column is successive for more than 3 values

Jim Lemon-4
Hi Knut,
As Bert said, you can start with diff and work from there. I can
easily get the text for the subset, but despite fooling around with
"parse", "eval" and "expression", I couldn't get it to work:

# use a bigger subset to test whether multiple runs can be extracted
kkdf<-subset(airquality,Temp > 77,select=c("Ozone","Temp"))
kkdf$index<-as.numeric(rownames(kkdf))
# get the run length encoding
seqindx<-rle(diff(kkdf$index)==1)
# get a logical vector of the starts of the runs
runsel<-seqindx$lengths >= 3 & seqindx$values
# get the indices for the starts of the runs
starts<-cumsum(seqindx$lengths)[runsel[-1]]+1
# and the ends
ends<-cumsum(seqindx$lengths)[runsel]+1
# the character representation of the subset as indices is
paste0("c(",paste(starts,ends,sep=":",collapse=","),")")

I expect there will be a lightning response from someone who knows
about converting the resulting string into whatever is needed.

Jim
On Fri, Sep 28, 2018 at 1:13 AM Bert Gunter <[hidden email]> wrote:

>
> 1. I assume the values are integers, not floats/numerics (which woud make
> it more complicated).
>
> 2. Strategy: Take differences (e.g. see ?diff) and look for >3 1's in a
> row.
>
> I don't have time to work out details, but perhaps that helps.
>
> Cheers,
> Bert
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Thu, Sep 27, 2018 at 7:49 AM Knut Krueger <[hidden email]>
> wrote:
>
> > Hi to all
> >
> > I need a subset for values if there are f.e 3 values successive in a
> > column of a Data Frame:
> > Example from the subset help page:
> >
> > subset(airquality, Temp > 80, select = c(Ozone, Temp))
> > 29     45   81
> > 35     NA   84
> > 36     NA   85
> > 38     29   82
> > 39     NA   87
> > 40     71   90
> > 41     39   87
> > 42     NA   93
> > 43     NA   92
> > 44     23   82
> > .....
> >
> > I would like to get only
> >
> > ...
> > 40     71   90
> > 41     39   87
> > 42     NA   93
> > 43     NA   92
> > 44     23   82
> > ....
> >
> > because the left column is ascending more than f.e three times without gap
> >
> > Any hints for a package or do I need to build a own function?
> >
> > Kind Regards Knut
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: subset only if f.e a column is successive for more than 3 values

Jim Lemon-4
Bugger! It's

eval(parse(text=paste0("kkdf[c(",paste(starts,ends,sep=":",collapse=","),"),]")))

What a mess!

Jim
On Fri, Sep 28, 2018 at 8:35 AM Jim Lemon <[hidden email]> wrote:

>
> Hi Knut,
> As Bert said, you can start with diff and work from there. I can
> easily get the text for the subset, but despite fooling around with
> "parse", "eval" and "expression", I couldn't get it to work:
>
> # use a bigger subset to test whether multiple runs can be extracted
> kkdf<-subset(airquality,Temp > 77,select=c("Ozone","Temp"))
> kkdf$index<-as.numeric(rownames(kkdf))
> # get the run length encoding
> seqindx<-rle(diff(kkdf$index)==1)
> # get a logical vector of the starts of the runs
> runsel<-seqindx$lengths >= 3 & seqindx$values
> # get the indices for the starts of the runs
> starts<-cumsum(seqindx$lengths)[runsel[-1]]+1
> # and the ends
> ends<-cumsum(seqindx$lengths)[runsel]+1
> # the character representation of the subset as indices is
> paste0("c(",paste(starts,ends,sep=":",collapse=","),")")
>
> I expect there will be a lightning response from someone who knows
> about converting the resulting string into whatever is needed.
>
> Jim
> On Fri, Sep 28, 2018 at 1:13 AM Bert Gunter <[hidden email]> wrote:
> >
> > 1. I assume the values are integers, not floats/numerics (which woud make
> > it more complicated).
> >
> > 2. Strategy: Take differences (e.g. see ?diff) and look for >3 1's in a
> > row.
> >
> > I don't have time to work out details, but perhaps that helps.
> >
> > Cheers,
> > Bert
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along and
> > sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> >
> > On Thu, Sep 27, 2018 at 7:49 AM Knut Krueger <[hidden email]>
> > wrote:
> >
> > > Hi to all
> > >
> > > I need a subset for values if there are f.e 3 values successive in a
> > > column of a Data Frame:
> > > Example from the subset help page:
> > >
> > > subset(airquality, Temp > 80, select = c(Ozone, Temp))
> > > 29     45   81
> > > 35     NA   84
> > > 36     NA   85
> > > 38     29   82
> > > 39     NA   87
> > > 40     71   90
> > > 41     39   87
> > > 42     NA   93
> > > 43     NA   92
> > > 44     23   82
> > > .....
> > >
> > > I would like to get only
> > >
> > > ...
> > > 40     71   90
> > > 41     39   87
> > > 42     NA   93
> > > 43     NA   92
> > > 44     23   82
> > > ....
> > >
> > > because the left column is ascending more than f.e three times without gap
> > >
> > > Any hints for a package or do I need to build a own function?
> > >
> > > Kind Regards Knut
> > >
> > > ______________________________________________
> > > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: subset only if f.e a column is successive for more than 3 values

Knut Krueger-8
Hi Jim,
thank's it is working with the given example,
but whats the difference when using

testdata=data.frame(TIME=c("17:11:20", "17:11:21", "17:11:22",
"17:11:23", "17:11:24", "17:11:25", "17:11:26", "17:11:27", "17:11:28",
"17:21:43",
                         "17:22:16", "17:22:19", "18:04:48", "18:04:49",
"18:04:50", "18:04:51", "18:04:52", "19:50:09", "00:59:27", "00:59:28",
                         "00:59:29", "04:13:40", "04:13:43", "04:13:44"),

index=c(8960,8961,8962,8963,8964,8965,8966,8967,8968,9583,9616,9619,12168,12169,12170,12171,12172,18489
                               ,37047,37048,37049,48700,48701,48702))

seqindx<-rle(diff(testdata$index)==1)
runsel<-seqindx$lengths >= 3 & seqindx$values
# get the indices for the starts of the runs
starts<-cumsum(seqindx$lengths)[runsel[-1]]+1
# and the ends
ends<-cumsum(seqindx$lengths)[runsel]+1

eval(parse(text=paste0("testdata[c(",paste(starts,ends,sep=":",collapse=","),"),]")))

the result (index)  is
12168,9619,9616,9583,8968,12168,12169,12170,12171,12172


maybe the gaps between .. 8967,8968,9583,9616,9619,12168,12169 ..?

Regards Knut

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: subset only if f.e a column is successive for more than 3 values

R help mailing list-2
In reply to this post by Knut Krueger-8
Do you also want lines 38 and 39 (in addition to 40:44), or do I
misunderstand your problem?

When you deal with runs of data, think of the rle (run-length encoding)
function.  E.g. here is
a barely tested function to find runs of a given minimum length and a given
difference between
successive values.  It also returns a 'runNumber' so you can split the
result into runs.

findRuns <- function(x, minRunLength=3, difference=1) {
     # for integral x, find runs of length at least 'minRunLength'
     # with 'difference' between succesive values
     d <- diff(x)
     dRle <- rle(d)
     w <- rep(dRle$lengths>=minRunLength-1 & dRle$values==difference,
dRle$lengths)
     values <- x[c(FALSE,w) | c(w,FALSE)]
     runNumber <- cumsum(c(TRUE, diff(values)!=difference))
     data.frame(values=values, runNumber=runNumber)
}

> findRuns(c(10,8,6,4,1,2,3,20,17,18,19,20))
  values runNumber
1      1         1
2      2         1
3      3         1
4     17         2
5     18         2
6     19         2
7     20         2
> findRuns(c(10,8,6,4,1,2,3,20,17,18,19,20), minRunLength=4)
  values runNumber
1     17         1
2     18         1
3     19         1
4     20         1
> findRuns(c(10,8,6,4,1,2,3,20,17,18,19,20), difference=-2)
  values runNumber
1     10         1
2      8         1
3      6         1
4      4         1


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Thu, Sep 27, 2018 at 7:48 AM, Knut Krueger <[hidden email]>
wrote:

> Hi to all
>
> I need a subset for values if there are f.e 3 values successive in a
> column of a Data Frame:
> Example from the subset help page:
>
> subset(airquality, Temp > 80, select = c(Ozone, Temp))
> 29     45   81
> 35     NA   84
> 36     NA   85
> 38     29   82
> 39     NA   87
> 40     71   90
> 41     39   87
> 42     NA   93
> 43     NA   92
> 44     23   82
> .....
>
> I would like to get only
>
> ...
> 40     71   90
> 41     39   87
> 42     NA   93
> 43     NA   92
> 44     23   82
> ....
>
> because the left column is ascending more than f.e three times without gap
>
> Any hints for a package or do I need to build a own function?
>
> Kind Regards Knut
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posti
> ng-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.