a function more appropriate than 'sapply'?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

a function more appropriate than 'sapply'?

emorway
I'm wondering if I need to use a function other than sapply as the following line of code runs indefinitely (or > 30 min so far) and uses up all 16Gb of memory on my machine for what seems like a very small dataset (data attached in a txt file wells.txt).  The R code is:

wells<-read.table("c:/temp/wells.txt",col.names=c("name","plc_hldr"))
wells2<-wells[sapply(wells[,1],function(x)length(strsplit(as.character(x), "_")[[1]])==2),]

The 2nd line of R code above gets bogged down and takes all my RAM with it:
RAM Loss

I'm simply trying to extract all of the lines of data that have a single "_" in the first column and place them into a dataset called "wells2".  If that were to work, I then want to extract the lines of data that have two "_" and put them into a separate dataset, say "wells3".  Is there a better way to do this than the one-liner above?

-Eric
Reply | Threaded
Open this post in threaded view
|

Re: a function more appropriate than 'sapply'?

arun kirshna
Hi,
May be this helps:
 wells<-read.table("wells.txt",header=FALSE,stringsAsFactors=F)


 wells2<-wells[-grep(".*\\_.*\\_",wells[,1]),]
  head(wells2)
  #   V1 V2
#1  w7_1  0
#2 w11_1  0
#3 w12_1  0
#4 w13_1  0
#5 w14_1  0
#6 w15_1  0



wellsNew<-wells[grep(".*\\_.*\\_",wells[,1]),]
 head(wellsNew)
#            V1 V2
#851 99_10_4395  0
#852 99_10_4396  0
#853 99_10_4400  0
#854 99_10_4403  0
#855 99_10_4404  0
#856 99_10_4606  0
 nrow(wells)
#[1] 46366
nrow(wells2)
#[1] 38080
 nrow(wellsNew)
#[1] 8286
 38080+8286
#[1] 46366
A.K.



----- Original Message -----
From: emorway <[hidden email]>
To: [hidden email]
Cc:
Sent: Saturday, January 26, 2013 1:43 PM
Subject: [R] a function more appropriate than 'sapply'?

I'm wondering if I need to use a function other than sapply as the following
line of code runs indefinitely (or > 30 min so far) and uses up all 16Gb of
memory on my machine for what seems like a very small dataset (data attached
in a txt file  wells.txt
<http://r.789695.n4.nabble.com/file/n4656723/wells.txt>  ).  The R code is:

wells<-read.table("c:/temp/wells.txt",col.names=c("name","plc_hldr"))
wells2<-wells[sapply(wells[,1],function(x)length(strsplit(as.character(x),
"_")[[1]])==2),]

The 2nd line of R code above gets bogged down and takes all my RAM with it:
<http://r.789695.n4.nabble.com/file/n4656723/memory_loss.png>

I'm simply trying to extract all of the lines of data that have a single "_"
in the first column and place them into a dataset called "wells2".  If that
were to work, I then want to extract the lines of data that have two "_" and
put them into a separate dataset, say "wells3".  Is there a better way to do
this than the one-liner above?

-Eric



--
View this message in context: http://r.789695.n4.nabble.com/a-function-more-appropriate-than-sapply-tp4656723.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: a function more appropriate than 'sapply'?

Berend Hasselman
In reply to this post by emorway

On 26-01-2013, at 19:43, emorway <[hidden email]> wrote:

> I'm wondering if I need to use a function other than sapply as the following
> line of code runs indefinitely (or > 30 min so far) and uses up all 16Gb of
> memory on my machine for what seems like a very small dataset (data attached
> in a txt file  wells.txt
> <http://r.789695.n4.nabble.com/file/n4656723/wells.txt>  ).  The R code is:
>
> wells<-read.table("c:/temp/wells.txt",col.names=c("name","plc_hldr"))
> wells2<-wells[sapply(wells[,1],function(x)length(strsplit(as.character(x),
> "_")[[1]])==2),]
>
> The 2nd line of R code above gets bogged down and takes all my RAM with it:
> <http://r.789695.n4.nabble.com/file/n4656723/memory_loss.png>
>
> I'm simply trying to extract all of the lines of data that have a single "_"
> in the first column and place them into a dataset called "wells2".  If that
> were to work, I then want to extract the lines of data that have two "_" and
> put them into a separate dataset, say "wells3".  Is there a better way to do
> this than the one-liner above?


Read your file with

        wells<-read.table("wells.txt",col.names=c("name","plc_hldr"), stringsAsFactors=FALSE)

Remove all non underscores with

        w.sub <- gsub("[^_]+","",wells[,1])

then select elements of w.sub with 2 underscores and a single underscore with

        u.2 <- which(w.sub=="__")
        u.1 <- which(w.sub=="_")

and use u.1 and u.2 to select the appropriate rows of wells.

I tried to select rows containing 1 or 2 underscores with grep regular expressions but that appeared to be more difficult than I had expected.
The method above is quick.

Berend

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: a function more appropriate than 'sapply'?

Uwe Ligges-3


On 26.01.2013 20:46, Berend Hasselman wrote:

>
> On 26-01-2013, at 19:43, emorway <[hidden email]> wrote:
>
>> I'm wondering if I need to use a function other than sapply as the following
>> line of code runs indefinitely (or > 30 min so far) and uses up all 16Gb of
>> memory on my machine for what seems like a very small dataset (data attached
>> in a txt file  wells.txt
>> <http://r.789695.n4.nabble.com/file/n4656723/wells.txt>  ).  The R code is:
>>
>> wells<-read.table("c:/temp/wells.txt",col.names=c("name","plc_hldr"))
>> wells2<-wells[sapply(wells[,1],function(x)length(strsplit(as.character(x),
>> "_")[[1]])==2),]
>>
>> The 2nd line of R code above gets bogged down and takes all my RAM with it:
>> <http://r.789695.n4.nabble.com/file/n4656723/memory_loss.png>
>>
>> I'm simply trying to extract all of the lines of data that have a single "_"
>> in the first column and place them into a dataset called "wells2".  If that
>> were to work, I then want to extract the lines of data that have two "_" and
>> put them into a separate dataset, say "wells3".  Is there a better way to do
>> this than the one-liner above?
>
>
> Read your file with
>
> wells<-read.table("wells.txt",col.names=c("name","plc_hldr"), stringsAsFactors=FALSE)
>
> Remove all non underscores with
>
> w.sub <- gsub("[^_]+","",wells[,1])
>
> then select elements of w.sub with 2 underscores and a single underscore with
>
> u.2 <- which(w.sub=="__")
> u.1 <- which(w.sub=="_")
>
> and use u.1 and u.2 to select the appropriate rows of wells.

With grep:

wells1 <- wells[grep("^[^\\_]*_[^\\_]*$", wells[,1]),]
wells2 <- wells[grep("^[^\\_]*_[^\\_]*_[^\\_]*$", wells[,1]),]


Best,
Uwe Ligges


> I tried to select rows containing 1 or 2 underscores with grep regular expressions but that appeared to be more difficult than I had expected.
> The method above is quick.
>
> Berend
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: a function more appropriate than 'sapply'?

arun kirshna
In reply to this post by Berend Hasselman
HI,

?grep() found to be a bit faster.


system.time({
indx<-grep(".*\\_.*\\_",wells[,1])
wells2<-wells[-indx,]
wellsNew<-wells[indx,]})
# user  system elapsed
 # 0.024   0.000   0.023

 system.time({
 w.sub <- gsub("[^_]+","",wells[,1])
  u.2 <- which(w.sub=="__")
  u.1 <- which(w.sub=="_")
 w.u1<- wells[u.1,]
w.u2<- wells[u.2,]})
#   user  system elapsed
 # 0.048   0.000   0.047
 identical(wells2,w.u1)
#[1] TRUE
 identical(wellsNew,w.u2)
#[1] TRUE


A.K.

----- Original Message -----
From: Berend Hasselman <[hidden email]>
To: emorway <[hidden email]>
Cc: [hidden email]
Sent: Saturday, January 26, 2013 2:46 PM
Subject: Re: [R] a function more appropriate than 'sapply'?


On 26-01-2013, at 19:43, emorway <[hidden email]> wrote:

> I'm wondering if I need to use a function other than sapply as the following
> line of code runs indefinitely (or > 30 min so far) and uses up all 16Gb of
> memory on my machine for what seems like a very small dataset (data attached
> in a txt file  wells.txt
> <http://r.789695.n4.nabble.com/file/n4656723/wells.txt>  ).  The R code is:
>
> wells<-read.table("c:/temp/wells.txt",col.names=c("name","plc_hldr"))
> wells2<-wells[sapply(wells[,1],function(x)length(strsplit(as.character(x),
> "_")[[1]])==2),]
>
> The 2nd line of R code above gets bogged down and takes all my RAM with it:
> <http://r.789695.n4.nabble.com/file/n4656723/memory_loss.png>
>
> I'm simply trying to extract all of the lines of data that have a single "_"
> in the first column and place them into a dataset called "wells2".  If that
> were to work, I then want to extract the lines of data that have two "_" and
> put them into a separate dataset, say "wells3".  Is there a better way to do
> this than the one-liner above?


Read your file with

    wells<-read.table("wells.txt",col.names=c("name","plc_hldr"), stringsAsFactors=FALSE)

Remove all non underscores with

    w.sub <- gsub("[^_]+","",wells[,1])

then select elements of w.sub with 2 underscores and a single underscore with

    u.2 <- which(w.sub=="__")
    u.1 <- which(w.sub=="_")

and use u.1 and u.2 to select the appropriate rows of wells.

I tried to select rows containing 1 or 2 underscores with grep regular expressions but that appeared to be more difficult than I had expected.
The method above is quick.

Berend

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: a function more appropriate than 'sapply'?

Berend Hasselman
In reply to this post by Uwe Ligges-3

On 26-01-2013, at 21:09, Uwe Ligges <[hidden email]> wrote:

>
>
> On 26.01.2013 20:46, Berend Hasselman wrote:
>>
>> On 26-01-2013, at 19:43, emorway <[hidden email]> wrote:
>>
>>> I'm wondering if I need to use a function other than sapply as the following
>>> line of code runs indefinitely (or > 30 min so far) and uses up all 16Gb of
>>> memory on my machine for what seems like a very small dataset (data attached
>>> in a txt file  wells.txt
>>> <http://r.789695.n4.nabble.com/file/n4656723/wells.txt>  ).  The R code is:
>>>
>>> wells<-read.table("c:/temp/wells.txt",col.names=c("name","plc_hldr"))
>>> wells2<-wells[sapply(wells[,1],function(x)length(strsplit(as.character(x),
>>> "_")[[1]])==2),]
>>>
>>> The 2nd line of R code above gets bogged down and takes all my RAM with it:
>>> <http://r.789695.n4.nabble.com/file/n4656723/memory_loss.png>
>>>
>>> I'm simply trying to extract all of the lines of data that have a single "_"
>>> in the first column and place them into a dataset called "wells2".  If that
>>> were to work, I then want to extract the lines of data that have two "_" and
>>> put them into a separate dataset, say "wells3".  Is there a better way to do
>>> this than the one-liner above?
>>
>>
>> Read your file with
>>
>> wells<-read.table("wells.txt",col.names=c("name","plc_hldr"), stringsAsFactors=FALSE)
>>
>> Remove all non underscores with
>>
>> w.sub <- gsub("[^_]+","",wells[,1])
>>
>> then select elements of w.sub with 2 underscores and a single underscore with
>>
>> u.2 <- which(w.sub=="__")
>> u.1 <- which(w.sub=="_")
>>
>> and use u.1 and u.2 to select the appropriate rows of wells.
>
> With grep:
>
> wells1 <- wells[grep("^[^\\_]*_[^\\_]*$", wells[,1]),]
> wells2 <- wells[grep("^[^\\_]*_[^\\_]*_[^\\_]*$", wells[,1]),]
>

Are the \\ necessary?
I tried without the \\ and that gives identical results.

Berend

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: a function more appropriate than 'sapply'?

Uwe Ligges-3


On 26.01.2013 21:23, Berend Hasselman wrote:

>
> On 26-01-2013, at 21:09, Uwe Ligges <[hidden email]> wrote:
>
>>
>>
>> On 26.01.2013 20:46, Berend Hasselman wrote:
>>>
>>> On 26-01-2013, at 19:43, emorway <[hidden email]> wrote:
>>>
>>>> I'm wondering if I need to use a function other than sapply as the following
>>>> line of code runs indefinitely (or > 30 min so far) and uses up all 16Gb of
>>>> memory on my machine for what seems like a very small dataset (data attached
>>>> in a txt file  wells.txt
>>>> <http://r.789695.n4.nabble.com/file/n4656723/wells.txt>  ).  The R code is:
>>>>
>>>> wells<-read.table("c:/temp/wells.txt",col.names=c("name","plc_hldr"))
>>>> wells2<-wells[sapply(wells[,1],function(x)length(strsplit(as.character(x),
>>>> "_")[[1]])==2),]
>>>>
>>>> The 2nd line of R code above gets bogged down and takes all my RAM with it:
>>>> <http://r.789695.n4.nabble.com/file/n4656723/memory_loss.png>
>>>>
>>>> I'm simply trying to extract all of the lines of data that have a single "_"
>>>> in the first column and place them into a dataset called "wells2".  If that
>>>> were to work, I then want to extract the lines of data that have two "_" and
>>>> put them into a separate dataset, say "wells3".  Is there a better way to do
>>>> this than the one-liner above?
>>>
>>>
>>> Read your file with
>>>
>>> wells<-read.table("wells.txt",col.names=c("name","plc_hldr"), stringsAsFactors=FALSE)
>>>
>>> Remove all non underscores with
>>>
>>> w.sub <- gsub("[^_]+","",wells[,1])
>>>
>>> then select elements of w.sub with 2 underscores and a single underscore with
>>>
>>> u.2 <- which(w.sub=="__")
>>> u.1 <- which(w.sub=="_")
>>>
>>> and use u.1 and u.2 to select the appropriate rows of wells.
>>
>> With grep:
>>
>> wells1 <- wells[grep("^[^\\_]*_[^\\_]*$", wells[,1]),]
>> wells2 <- wells[grep("^[^\\_]*_[^\\_]*_[^\\_]*$", wells[,1]),]
>>
>
> Are the \\ necessary?
> I tried without the \\ and that gives identical results.

Ah, I was not sure and then I forgot to look into the docs. Let's pass
it as an exercise to the reader.

Best,
Uwe



>
> Berend
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.