Quantcast

extracting values from txt file that follow user-supplied quote

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

extracting values from txt file that follow user-supplied quote

emorway
useRs-

I'm attempting to scan a more than 1Gb text file and read and store the values that follow a specific key-phrase that is repeated multiple time throughout the file.  A snippet of the text file I'm trying to read is attached.  The text file is a dumping ground for various aspects of the performance of the model that generates it.  Thus, the location of information I'm wanting to extract from the file is not in a fixed position (i.e. it does not always appears in a predictable location, like line 1000, or 2000, etc.).  Rather, the desired values always follow a specific phrase: "   PERCENT DISCREPANCY ="

One approach I took was the following:

library(R.utils)

txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r")
#The above will need to be altered if one desires to test code on the attached txt file, which will run much quicker
system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out"))
#elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon
num_lines
#14405247

system.time(
for(i in 1:num_lines){
  txt_line<-readLines(txt_con,n=1)
  if (length(grep("    PERCENT DISCREPANCY =",txt_line))) {
    pd<-c(pd,as.numeric(substr(txt_line,70,78)))
  }
}
)
#Time took about 5 minutes

The inefficiencies in this approach arise due to reading the file twice (first to get num_lines, then to step through each line looking for the desired text).  

Is there a way to speed this process up through the use of a ?scan  ?  I wan't able to get anything working, but what I had in mind was scan through the more than 1Gb file and when the keyphrase (e.g.  "     PERCENT DISCREPANCY =  ") is encountered, read and store the next 13 characters (which will include some white spaces) as a numeric value, then resume the scan until the key phrase is encountered again and repeat until the end-of-the-file marker is encountered.  Is such an approach even possible or is line-by-line the best bet?

MCR.out

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: extracting values from txt file that follow user-supplied quote

Rainer Schuermann
R may not be the best tool for this.
Did you look at gawk? It is also available for Windows:
http://gnuwin32.sourceforge.net/packages/gawk.htm

Once gawk has written a new file that only contains the lines / data you want, you could use R for the next steps.
You also can run gawk from within R with the System() command.

Rgds,
Rainer


On Wednesday 06 June 2012 09:54:15 emorway wrote:

> useRs-
>
> I'm attempting to scan a more than 1Gb text file and read and store the
> values that follow a specific key-phrase that is repeated multiple time
> throughout the file.  A snippet of the text file I'm trying to read is
> attached.  The text file is a dumping ground for various aspects of the
> performance of the model that generates it.  Thus, the location of
> information I'm wanting to extract from the file is not in a fixed position
> (i.e. it does not always appears in a predictable location, like line 1000,
> or 2000, etc.).  Rather, the desired values always follow a specific phrase:
> "   PERCENT DISCREPANCY ="
>
> One approach I took was the following:
>
> library(R.utils)
>
> txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r")
> #The above will need to be altered if one desires to test code on the
> attached txt file, which will run much quicker
> system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out"))
> #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon
> num_lines
> #14405247
>
> system.time(
> for(i in 1:num_lines){
>   txt_line<-readLines(txt_con,n=1)
>   if (length(grep("    PERCENT DISCREPANCY =",txt_line))) {
>     pd<-c(pd,as.numeric(substr(txt_line,70,78)))
>   }
> }
> )
> #Time took about 5 minutes
>
> The inefficiencies in this approach arise due to reading the file twice
> (first to get num_lines, then to step through each line looking for the
> desired text).  
>
> Is there a way to speed this process up through the use of a ?scan  ?  I
> wan't able to get anything working, but what I had in mind was scan through
> the more than 1Gb file and when the keyphrase (e.g.  "     PERCENT
> DISCREPANCY =  ") is encountered, read and store the next 13 characters
> (which will include some white spaces) as a numeric value, then resume the
> scan until the key phrase is encountered again and repeat until the
> end-of-the-file marker is encountered.  Is such an approach even possible or
> is line-by-line the best bet?
>
> http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: extracting values from txt file that follow user-supplied quote

Bert Gunter
I think 1 gb is small enough that this can be easily and efficiently
done in R. The key is: regular expressions are your friend.

I shall assume that the text file has been read into R as a single
character string, named "mystring" . The code below could easily be
modifed to work on a a vector of strings if the file is read in line
by line. Alternatively, (paste, yourvec,collapse="") could be used to
collapse such a vector into a single string, which would be necessary
if, for instance, the key info could be broken up over several
lines/elements of the vector.

Here is a small reproducible example of a way to do this, in which the
keword string to search for is "qdxpRxt" and the next 3 characters
that follow it are what you wish to extract.

### Create the example
test <- lapply(sample(1:20,30,rep=TRUE) ,function(n)
                        paste(sample(c(letters,LETTERS,rep(" ",12)),n,rep=TRUE),collapse=""))
mystring <- paste(lapply(test,function(x)paste(x,c("qdxpRxt"),
  floor(runif(1,0,1000)),sep="")),collapse="")

## Extract the strings of interest
extracted <-  strsplit(gsub(".*?qdxpRxt(.{3})","\\1%&",mystring),"%&")

Note that this is not quite right yet. The extracted strings might
include characters, not just numbers; the strings have to be converted
from character to numeric; and I assumed that "%&" would not occur in
the extracted strings and could be used as a split string. I leave the
fixups to you.

Note that I make no claims about the efficiency of this approach vis a
vis external tools like gawk -- of which there are many -- nor that
any of several R string function packages might also easily do the
job. I only wanted to point out that it appears to be straightforward
in vanilla R since this already has the regular expression engine in
it.

Cheers,
Bert


On Wed, Jun 6, 2012 at 10:34 AM, Rainer Schuermann
<[hidden email]> wrote:

> R may not be the best tool for this.
> Did you look at gawk? It is also available for Windows:
> http://gnuwin32.sourceforge.net/packages/gawk.htm
>
> Once gawk has written a new file that only contains the lines / data you want, you could use R for the next steps.
> You also can run gawk from within R with the System() command.
>
> Rgds,
> Rainer
>
>
> On Wednesday 06 June 2012 09:54:15 emorway wrote:
>> useRs-
>>
>> I'm attempting to scan a more than 1Gb text file and read and store the
>> values that follow a specific key-phrase that is repeated multiple time
>> throughout the file.  A snippet of the text file I'm trying to read is
>> attached.  The text file is a dumping ground for various aspects of the
>> performance of the model that generates it.  Thus, the location of
>> information I'm wanting to extract from the file is not in a fixed position
>> (i.e. it does not always appears in a predictable location, like line 1000,
>> or 2000, etc.).  Rather, the desired values always follow a specific phrase:
>> "   PERCENT DISCREPANCY ="
>>
>> One approach I took was the following:
>>
>> library(R.utils)
>>
>> txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r")
>> #The above will need to be altered if one desires to test code on the
>> attached txt file, which will run much quicker
>> system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out"))
>> #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon
>> num_lines
>> #14405247
>>
>> system.time(
>> for(i in 1:num_lines){
>>   txt_line<-readLines(txt_con,n=1)
>>   if (length(grep("    PERCENT DISCREPANCY =",txt_line))) {
>>     pd<-c(pd,as.numeric(substr(txt_line,70,78)))
>>   }
>> }
>> )
>> #Time took about 5 minutes
>>
>> The inefficiencies in this approach arise due to reading the file twice
>> (first to get num_lines, then to step through each line looking for the
>> desired text).
>>
>> Is there a way to speed this process up through the use of a ?scan  ?  I
>> wan't able to get anything working, but what I had in mind was scan through
>> the more than 1Gb file and when the keyphrase (e.g.  "     PERCENT
>> DISCREPANCY =  ") is encountered, read and store the next 13 characters
>> (which will include some white spaces) as a numeric value, then resume the
>> scan until the key phrase is encountered again and repeat until the
>> end-of-the-file marker is encountered.  Is such an approach even possible or
>> is line-by-line the best bet?
>>
>> http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out
>>
>>
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

extracting values from txt with regular expression

emorway
Thanks for your suggestions.  Bert, in your response you raised my awareness to "regular expressions".  Are regular expressions the same across various languages?  Consider the following line of text:

txt_line<-" PERCENT DISCREPANCY =           0.01     PERCENT DISCREPANCY =          -0.05"

It seems python uses the following line of code to extract the two values in "txt_line" and store them in a variable called "v":

v = re.findall("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?", line)
#v[0]  0.01
#v[1]  -0.05

I tried something similar in R (but it didn't work) by using the same regular expression, but got an error:

edm<-grep("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?",txt_line)
#Error: '\d' is an unrecognized escape in character string starting "[+-]? *(?:\d"

I'm not even sure which function in R most efficiently extracts the values from "txt_line".  Basically, I want to peel out the values and think I can use the decimal point to construct the regular expression, but don't know where to go from here?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: extracting values from txt with regular expression

Nordlund, Dan (DSHS/RDA)
> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r-
> project.org] On Behalf Of emorway
> Sent: Thursday, June 07, 2012 10:41 AM
> To: [hidden email]
> Subject: [R] extracting values from txt with regular expression
>
> Thanks for your suggestions.  Bert, in your response you raised my
> awareness
> to "regular expressions".  Are regular expressions the same across
> various
> languages?  Consider the following line of text:
>
> txt_line<-" PERCENT DISCREPANCY =           0.01     PERCENT
> DISCREPANCY =
> -0.05"
>
> It seems python uses the following line of code to extract the two
> values in
> "txt_line" and store them in a variable called "v":
>
> v = re.findall("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?", line)
> #v[0]  0.01
> #v[1]  -0.05
>
> I tried something similar in R (but it didn't work) by using the same
> regular expression, but got an error:
>
> edm<-grep("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?",txt_line)
> #Error: '\d' is an unrecognized escape in character string starting
> "[+-]?
> *(?:\d"
>
> I'm not even sure which function in R most efficiently extracts the
> values
> from "txt_line".  Basically, I want to peel out the values and think I
> can
> use the decimal point to construct the regular expression, but don't
> know
> where to go from here?
>

I am a regular expression novice, but the error message you are receiving is the result of not doubling the backslashes in your regular expression pattern.  The backslash needs to be escaped.  So this will get you close to what you want (although not necessarily efficiently).

ndx <- gregexpr("[+-]?(?:\\d+(?:\\.\\d*)|\\.\\d+)(?:[eE][+-]?\\d+)?",txt_line)
matched <- regmatches(txt_line, ndx)
matched


Hope this is helpful,

Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204


______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: extracting values from txt file that follow user-supplied quote

Rui Barradas
In reply to this post by emorway
Hello,

I've just read your follow-up question on regular expressions, and I
believe this, your original problem, can be made much faster. Just use
readLine() differently, reading large amounts of text lines at a time.
For this to work you will still need to know the total number of lines
in the file.



fun <- function(con, pattern, nlines, n=5000L){
        if(is.character(con)){
                con <- file(con, open="rt")
                on.exit(close(con))
        }
        passes <- nlines %/% n
        remaining <- nlines %% n
        res <- NULL
        for(i in seq_len(passes)){
                txt <- readLines(con, n=n)
                res <- c(res, as.numeric(substr(txt[grepl(pattern, txt)], 70, 78)))
        }
        if(remaining){
                txt <- readLines(con, n=remaining)
                res <- c(res, as.numeric(substr(txt[grepl(pattern, txt)], 70, 78)))
        }
        res
}


url <- "http://r.789695.n4.nabble.com/file/n4632558/MCR.out"
pat <- "PERCENT DISCREPANCY ="
num_lines <- 14405247L

# your original
txt_con<-file(description=url,open="r")
pd <- NULL
t1 <- system.time(
for(i in 1:num_lines){
   txt_line<-readLines(txt_con,n=1)
   if (length(grep(pat,txt_line))) {
     pd<-c(pd,as.numeric(substr(txt_line,70,78)))
   }
}
)
close(txt_con)

# the function above, increased 'n'
t2 <- system.time(pd2 <- fun(url, pat, num_lines, 100000L))

all.equal(pd, pd2)
[1] TRUE
rbind(original=t1, fun=t2, ratio=t1/t2)
          user.self sys.self  elapsed user.child sys.child
original    780.16   196.16 981.9100         NA        NA
fun           0.10     0.04   3.2000         NA        NA
ratio      7801.60  4904.00 306.8469         NA        NA


A factor of 300.

Hope this helps,

Rui Barradas

Em 06-06-2012 17:54, emorway escreveu:

> useRs-
>
> I'm attempting to scan a more than 1Gb text file and read and store the
> values that follow a specific key-phrase that is repeated multiple time
> throughout the file.  A snippet of the text file I'm trying to read is
> attached.  The text file is a dumping ground for various aspects of the
> performance of the model that generates it.  Thus, the location of
> information I'm wanting to extract from the file is not in a fixed position
> (i.e. it does not always appears in a predictable location, like line 1000,
> or 2000, etc.).  Rather, the desired values always follow a specific phrase:
> "   PERCENT DISCREPANCY ="
>
> One approach I took was the following:
>
> library(R.utils)
>
> txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r")
> #The above will need to be altered if one desires to test code on the
> attached txt file, which will run much quicker
> system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out"))
> #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon
> num_lines
> #14405247
>
> system.time(
> for(i in 1:num_lines){
>    txt_line<-readLines(txt_con,n=1)
>    if (length(grep("    PERCENT DISCREPANCY =",txt_line))) {
>      pd<-c(pd,as.numeric(substr(txt_line,70,78)))
>    }
> }
> )
> #Time took about 5 minutes
>
> The inefficiencies in this approach arise due to reading the file twice
> (first to get num_lines, then to step through each line looking for the
> desired text).
>
> Is there a way to speed this process up through the use of a ?scan  ?  I
> wan't able to get anything working, but what I had in mind was scan through
> the more than 1Gb file and when the keyphrase (e.g.  "     PERCENT
> DISCREPANCY =  ") is encountered, read and store the next 13 characters
> (which will include some white spaces) as a numeric value, then resume the
> scan until the key phrase is encountered again and repeat until the
> end-of-the-file marker is encountered.  Is such an approach even possible or
> is line-by-line the best bet?
>
> http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: extracting values from txt with regular expression

emorway
In reply to this post by Nordlund, Dan (DSHS/RDA)
Hi Dan and Rui,  Thank you for the suggestions, both were very helpful.  Rui's code was quite fast...there is one more thing I want to explore for my own edification, but first I need some help fixing the code below, which is a slight modification to Dan's suggestion.  It'll no doubt be tough to beat the time Rui's code finished the task in, but I'm willing to try.  First, I need to fix the following, which 'peels' the wrong bit of text from "txt_line".  Instead of extracting as it now does (shown below), can the code be modified to extract the values 0.01 and -0.05, and store them in the variable 'extracted'?

txt_line<-" PERCENT DISCREPANCY =           0.01     PERCENT DISCREPANCY =          -0.05"
extracted <-  strsplit(gsub("[+-]?(?:\\d+(?:\\.\\d*)|\\.\\d+)(?:[eE][+-]?\\d+)?","\\1%&",txt_line),"%&")
extracted
#[1] " PERCENT DISCREPANCY =           "    "     PERCENT DISCREPANCY =          "

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: extracting values from txt with regular expression

Rui Barradas
Hello,

Just put the entire regexp between parenthesis.

extracted <-
strsplit(gsub("([+-]?(?:\\d+(?:\\.\\d*)|\\.\\d+)(?:[eE][+-]?\\d+)?)","\\1%&",txt_line),"%&")

extracted

sapply(strsplit(unlist(extracted), "="), "[", 2)


As for speed, I believe that this might take longer. It will have to
match a regular expression, then substitute, then split. A routine like
the one I've send usually gives an order of magnitude or more. The first
time I've written one was around 20 years ago, I can now write it with
my eyes closed and it consistently beats alternatives but there's no
harm in trying. Or in combining strategies.

Good luck.

Rui Barradas

Em 08-06-2012 04:52, emorway escreveu:

> Hi Dan and Rui,  Thank you for the suggestions, both were very helpful.
> Rui's code was quite fast...there is one more thing I want to explore for my
> own edification, but first I need some help fixing the code below, which is
> a slight modification to Dan's suggestion.  It'll no doubt be tough to beat
> the time Rui's code finished the task in, but I'm willing to try.  First, I
> need to fix the following, which 'peels' the wrong bit of text from
> "txt_line".  Instead of extracting as it now does (shown below), can the
> code be modified to extract the values 0.01 and -0.05, and store them in the
> variable 'extracted'?
>
> txt_line<-" PERCENT DISCREPANCY =           0.01     PERCENT DISCREPANCY =
> -0.05"
> extracted <-
> strsplit(gsub("[+-]?(?:\\d+(?:\\.\\d*)|\\.\\d+)(?:[eE][+-]?\\d+)?","\\1%&",txt_line),"%&")
> extracted
> #[1] " PERCENT DISCREPANCY =           "    "     PERCENT DISCREPANCY =
> "
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558p4632753.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: extracting values from txt with regular expression

Gabor Grothendieck
In reply to this post by emorway
On Thu, Jun 7, 2012 at 1:40 PM, emorway <[hidden email]> wrote:

> Thanks for your suggestions.  Bert, in your response you raised my awareness
> to "regular expressions".  Are regular expressions the same across various
> languages?  Consider the following line of text:
>
> txt_line<-" PERCENT DISCREPANCY =           0.01     PERCENT DISCREPANCY =
> -0.05"
>
> It seems python uses the following line of code to extract the two values in
> "txt_line" and store them in a variable called "v":
>
> v = re.findall("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?", line)
> #v[0]  0.01
> #v[1]  -0.05
>
> I tried something similar in R (but it didn't work) by using the same
> regular expression, but got an error:
>
> edm<-grep("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?",txt_line)
> #Error: '\d' is an unrecognized escape in character string starting "[+-]?
> *(?:\d"
>
> I'm not even sure which function in R most efficiently extracts the values
> from "txt_line".  Basically, I want to peel out the values and think I can
> use the decimal point to construct the regular expression, but don't know
> where to go from here?

Try this.  strapply applies the function (3rd argument) to each match
of the regular expressoin (2nd argument) outputting the result of the
function.  The regular expression we have used matches a minus or
digit followed by non-spaces.  That seems good enough for this simple
example but, of course, it can be changed.

> library(gsubfn)
> p <- "[-0-9]\\S+"
> txt_line <- " PERCENT DISCREPANCY =           0.01     PERCENT DISCREPANCY = -0.05"
>
> strapply(txt_line, p, as.numeric)[[1]]
[1]  0.01 -0.05

or using strapplyc (which is similar but uses c as the function) and
is optimized for speed:

> as.numeric(strapplyc(txt_line, p)[[1]])
[1]  0.01 -0.05

If we are only parsing a few lines then the speed does not matter but
if there are large amounts to parse then be sure to have the tcltk
package installed to get the best speed from the gsubfn functions (on
Windows and most but not all Linux systems tcltk is installed by
default but on a few you have to do it yourself).  If you don't have
tcltk the gsubfn package will use R which is slower.  Also, as noted,
strapplyc is faster than strapply.  There are arguments and options
that can override the defaults.

The gsubfn home page is at http://gsubfn.googlecode.com

regular expressions are largely the same but not 100% identical across
languages.  There are some links to regular expression info in
different languages at the bottom of the home page just listed.   R
can use R or perl regular expressions and the gsubfn functions, in
addition, can use tcl regular expressions.

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: extracting values from txt file that follow user-supplied quote

Gabor Grothendieck
In reply to this post by emorway
On Wed, Jun 6, 2012 at 12:54 PM, emorway <[hidden email]> wrote:

> useRs-
>
> I'm attempting to scan a more than 1Gb text file and read and store the
> values that follow a specific key-phrase that is repeated multiple time
> throughout the file.  A snippet of the text file I'm trying to read is
> attached.  The text file is a dumping ground for various aspects of the
> performance of the model that generates it.  Thus, the location of
> information I'm wanting to extract from the file is not in a fixed position
> (i.e. it does not always appears in a predictable location, like line 1000,
> or 2000, etc.).  Rather, the desired values always follow a specific phrase:
> "   PERCENT DISCREPANCY ="
>
> One approach I took was the following:
>
> library(R.utils)
>
> txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r")
> #The above will need to be altered if one desires to test code on the
> attached txt file, which will run much quicker
> system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out"))
> #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon
> num_lines
> #14405247
>
> system.time(
> for(i in 1:num_lines){
>  txt_line<-readLines(txt_con,n=1)
>  if (length(grep("    PERCENT DISCREPANCY =",txt_line))) {
>    pd<-c(pd,as.numeric(substr(txt_line,70,78)))
>  }
> }
> )
> #Time took about 5 minutes
>
> The inefficiencies in this approach arise due to reading the file twice
> (first to get num_lines, then to step through each line looking for the
> desired text).
>
> Is there a way to speed this process up through the use of a ?scan  ?  I
> wan't able to get anything working, but what I had in mind was scan through
> the more than 1Gb file and when the keyphrase (e.g.  "     PERCENT
> DISCREPANCY =  ") is encountered, read and store the next 13 characters
> (which will include some white spaces) as a numeric value, then resume the
> scan until the key phrase is encountered again and repeat until the
> end-of-the-file marker is encountered.  Is such an approach even possible or
> is line-by-line the best bet?
>
> http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out

Try this:

g <- function(url, string, from, to, ...) {
        L <- readLines(url)
        matched <- grep(string, L, value = TRUE, ...)
        as.numeric(substring(matched, from, to))
}

> url <- "http://r.789695.n4.nabble.com/file/n4632558/MCR.out"
> g(url, "PERCENT DISCREPANCY = ", 70, 78, fixed = TRUE)
[1]    NA  0.00 -0.01    NA  0.00 -0.01


--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: extracting values from txt file that follow user-supplied quote

emorway
I'll summarize the results in terms of total run time for the suggestions that have been made as well as post the code for those that come across this post in the future.  First the results (the code for which is provided second):

What I tried to do using suggestions from Bert and Dan:
t1
#   user  system elapsed
# 208.21    1.68  210.34

Gabor's suggested code:
t2
#   user  system elapsed
#  51.12    0.63   51.75

Rui's suggested code:
t3a  #(Get the number of lines)
#   user  system elapsed
#  45.13   11.08   56.23
t3b  #(now perform the function
#   user  system elapsed
#  50.59    0.55   51.16

So in summary it appears that Gabor's and Rui's code are quite similar (in terms of runtime) if a priori knowledge of the number of lines in the file is known (e.g. t2 is roughly equal to t3b).  It would seem Gabor's code is a little more robust since it doesn't require the total number of lines in the file be supplied.  And here is the code used to get these times (note that the file I used was the 1GB text file, not the reduced version attached to the top post of this thread):

#----------------
#modified attempt
#----------------

library(gsubfn)
library(tcltk2)
p<-"[-0-9]\\S+"
pd<-numeric()
txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r")
t1<-system.time(
while (length(txt_line<-readLines(txt_con,n=1))){
  if (length(grep("DISCREPANCY = ",txt_line))) {
    pd<-c(pd,as.numeric(strapplyc(txt_line, p)[[1]]))
  }
})
close(txt_con)
t1
#   user  system elapsed
# 208.21    1.68  210.34


#----------------------------
#Suggested by G. Grothendieck
#----------------------------

g<-function(txt_con, string, from, to, ...) {
        L <- readLines(txt_con)
        matched <- grep(string, L, value = TRUE, ...)
        as.numeric(substring(matched, from, to))
}
txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r")
t2<-system.time(
  edm<-g(txt_con, "PERCENT DISCREPANCY = ", 70, 78, fixed = TRUE)
)
close(txt_con)
t2
#   user  system elapsed
#  51.12    0.63   51.75

#-------------------------
#Suggested by Rui Barradas
#-------------------------

library(R.utils)
t3a<-system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out"))
t3a
#   user  system elapsed
#  45.13   11.08   56.23


fun <- function(con, pattern, nlines, n=5000L){
        if(is.character(con)){
                con <- file(con, open="rt")
                on.exit(close(con))
        }
        passes <- nlines %/% n
        remaining <- nlines %% n
        res <- NULL
        for(i in seq_len(passes)){
                txt <- readLines(con, n=n)
                res <- c(res, as.numeric(substr(txt[grepl(pattern, txt)], 70, 78)))
        }
        if(remaining){
                txt <- readLines(con, n=remaining)
                res <- c(res, as.numeric(substr(txt[grepl(pattern, txt)], 70, 78)))
        }
        res
}

txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r")
pat<-"PERCENT DISCREPANCY ="
num_lines <- 14405247L
t3b <- system.time(pd2 <- fun(txt_con, pat, num_lines, 100000L))
close(txt_con)
t3b
#   user  system elapsed
#  50.59    0.55   51.16

Loading...