|
useRs-
I'm attempting to scan a more than 1Gb text file and read and store the values that follow a specific key-phrase that is repeated multiple time throughout the file. A snippet of the text file I'm trying to read is attached. The text file is a dumping ground for various aspects of the performance of the model that generates it. Thus, the location of information I'm wanting to extract from the file is not in a fixed position (i.e. it does not always appears in a predictable location, like line 1000, or 2000, etc.). Rather, the desired values always follow a specific phrase: " PERCENT DISCREPANCY =" One approach I took was the following: library(R.utils) txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r") #The above will need to be altered if one desires to test code on the attached txt file, which will run much quicker system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out")) #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon num_lines #14405247 system.time( for(i in 1:num_lines){ txt_line<-readLines(txt_con,n=1) if (length(grep(" PERCENT DISCREPANCY =",txt_line))) { pd<-c(pd,as.numeric(substr(txt_line,70,78))) } } ) #Time took about 5 minutes The inefficiencies in this approach arise due to reading the file twice (first to get num_lines, then to step through each line looking for the desired text). Is there a way to speed this process up through the use of a ?scan ? I wan't able to get anything working, but what I had in mind was scan through the more than 1Gb file and when the keyphrase (e.g. " PERCENT DISCREPANCY = ") is encountered, read and store the next 13 characters (which will include some white spaces) as a numeric value, then resume the scan until the key phrase is encountered again and repeat until the end-of-the-file marker is encountered. Is such an approach even possible or is line-by-line the best bet? MCR.out |
|
R may not be the best tool for this.
Did you look at gawk? It is also available for Windows: http://gnuwin32.sourceforge.net/packages/gawk.htm Once gawk has written a new file that only contains the lines / data you want, you could use R for the next steps. You also can run gawk from within R with the System() command. Rgds, Rainer On Wednesday 06 June 2012 09:54:15 emorway wrote: > useRs- > > I'm attempting to scan a more than 1Gb text file and read and store the > values that follow a specific key-phrase that is repeated multiple time > throughout the file. A snippet of the text file I'm trying to read is > attached. The text file is a dumping ground for various aspects of the > performance of the model that generates it. Thus, the location of > information I'm wanting to extract from the file is not in a fixed position > (i.e. it does not always appears in a predictable location, like line 1000, > or 2000, etc.). Rather, the desired values always follow a specific phrase: > " PERCENT DISCREPANCY =" > > One approach I took was the following: > > library(R.utils) > > txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r") > #The above will need to be altered if one desires to test code on the > attached txt file, which will run much quicker > system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out")) > #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon > num_lines > #14405247 > > system.time( > for(i in 1:num_lines){ > txt_line<-readLines(txt_con,n=1) > if (length(grep(" PERCENT DISCREPANCY =",txt_line))) { > pd<-c(pd,as.numeric(substr(txt_line,70,78))) > } > } > ) > #Time took about 5 minutes > > The inefficiencies in this approach arise due to reading the file twice > (first to get num_lines, then to step through each line looking for the > desired text). > > Is there a way to speed this process up through the use of a ?scan ? I > wan't able to get anything working, but what I had in mind was scan through > the more than 1Gb file and when the keyphrase (e.g. " PERCENT > DISCREPANCY = ") is encountered, read and store the next 13 characters > (which will include some white spaces) as a numeric value, then resume the > scan until the key phrase is encountered again and repeat until the > end-of-the-file marker is encountered. Is such an approach even possible or > is line-by-line the best bet? > > http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out > > > > -- > View this message in context: http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
I think 1 gb is small enough that this can be easily and efficiently
done in R. The key is: regular expressions are your friend. I shall assume that the text file has been read into R as a single character string, named "mystring" . The code below could easily be modifed to work on a a vector of strings if the file is read in line by line. Alternatively, (paste, yourvec,collapse="") could be used to collapse such a vector into a single string, which would be necessary if, for instance, the key info could be broken up over several lines/elements of the vector. Here is a small reproducible example of a way to do this, in which the keword string to search for is "qdxpRxt" and the next 3 characters that follow it are what you wish to extract. ### Create the example test <- lapply(sample(1:20,30,rep=TRUE) ,function(n) paste(sample(c(letters,LETTERS,rep(" ",12)),n,rep=TRUE),collapse="")) mystring <- paste(lapply(test,function(x)paste(x,c("qdxpRxt"), floor(runif(1,0,1000)),sep="")),collapse="") ## Extract the strings of interest extracted <- strsplit(gsub(".*?qdxpRxt(.{3})","\\1%&",mystring),"%&") Note that this is not quite right yet. The extracted strings might include characters, not just numbers; the strings have to be converted from character to numeric; and I assumed that "%&" would not occur in the extracted strings and could be used as a split string. I leave the fixups to you. Note that I make no claims about the efficiency of this approach vis a vis external tools like gawk -- of which there are many -- nor that any of several R string function packages might also easily do the job. I only wanted to point out that it appears to be straightforward in vanilla R since this already has the regular expression engine in it. Cheers, Bert On Wed, Jun 6, 2012 at 10:34 AM, Rainer Schuermann <[hidden email]> wrote: > R may not be the best tool for this. > Did you look at gawk? It is also available for Windows: > http://gnuwin32.sourceforge.net/packages/gawk.htm > > Once gawk has written a new file that only contains the lines / data you want, you could use R for the next steps. > You also can run gawk from within R with the System() command. > > Rgds, > Rainer > > > On Wednesday 06 June 2012 09:54:15 emorway wrote: >> useRs- >> >> I'm attempting to scan a more than 1Gb text file and read and store the >> values that follow a specific key-phrase that is repeated multiple time >> throughout the file. A snippet of the text file I'm trying to read is >> attached. The text file is a dumping ground for various aspects of the >> performance of the model that generates it. Thus, the location of >> information I'm wanting to extract from the file is not in a fixed position >> (i.e. it does not always appears in a predictable location, like line 1000, >> or 2000, etc.). Rather, the desired values always follow a specific phrase: >> " PERCENT DISCREPANCY =" >> >> One approach I took was the following: >> >> library(R.utils) >> >> txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r") >> #The above will need to be altered if one desires to test code on the >> attached txt file, which will run much quicker >> system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out")) >> #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon >> num_lines >> #14405247 >> >> system.time( >> for(i in 1:num_lines){ >> txt_line<-readLines(txt_con,n=1) >> if (length(grep(" PERCENT DISCREPANCY =",txt_line))) { >> pd<-c(pd,as.numeric(substr(txt_line,70,78))) >> } >> } >> ) >> #Time took about 5 minutes >> >> The inefficiencies in this approach arise due to reading the file twice >> (first to get num_lines, then to step through each line looking for the >> desired text). >> >> Is there a way to speed this process up through the use of a ?scan ? I >> wan't able to get anything working, but what I had in mind was scan through >> the more than 1Gb file and when the keyphrase (e.g. " PERCENT >> DISCREPANCY = ") is encountered, read and store the next 13 characters >> (which will include some white spaces) as a numeric value, then resume the >> scan until the key phrase is encountered again and repeat until the >> end-of-the-file marker is encountered. Is such an approach even possible or >> is line-by-line the best bet? >> >> http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out >> >> >> >> -- >> View this message in context: http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558.html >> Sent from the R help mailing list archive at Nabble.com. >> >> ______________________________________________ >> [hidden email] mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
Thanks for your suggestions. Bert, in your response you raised my awareness to "regular expressions". Are regular expressions the same across various languages? Consider the following line of text:
txt_line<-" PERCENT DISCREPANCY = 0.01 PERCENT DISCREPANCY = -0.05" It seems python uses the following line of code to extract the two values in "txt_line" and store them in a variable called "v": v = re.findall("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?", line) #v[0] 0.01 #v[1] -0.05 I tried something similar in R (but it didn't work) by using the same regular expression, but got an error: edm<-grep("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?",txt_line) #Error: '\d' is an unrecognized escape in character string starting "[+-]? *(?:\d" I'm not even sure which function in R most efficiently extracts the values from "txt_line". Basically, I want to peel out the values and think I can use the decimal point to construct the regular expression, but don't know where to go from here? |
|
> -----Original Message-----
> From: [hidden email] [mailto:r-help-bounces@r- > project.org] On Behalf Of emorway > Sent: Thursday, June 07, 2012 10:41 AM > To: [hidden email] > Subject: [R] extracting values from txt with regular expression > > Thanks for your suggestions. Bert, in your response you raised my > awareness > to "regular expressions". Are regular expressions the same across > various > languages? Consider the following line of text: > > txt_line<-" PERCENT DISCREPANCY = 0.01 PERCENT > DISCREPANCY = > -0.05" > > It seems python uses the following line of code to extract the two > values in > "txt_line" and store them in a variable called "v": > > v = re.findall("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?", line) > #v[0] 0.01 > #v[1] -0.05 > > I tried something similar in R (but it didn't work) by using the same > regular expression, but got an error: > > edm<-grep("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?",txt_line) > #Error: '\d' is an unrecognized escape in character string starting > "[+-]? > *(?:\d" > > I'm not even sure which function in R most efficiently extracts the > values > from "txt_line". Basically, I want to peel out the values and think I > can > use the decimal point to construct the regular expression, but don't > know > where to go from here? > I am a regular expression novice, but the error message you are receiving is the result of not doubling the backslashes in your regular expression pattern. The backslash needs to be escaped. So this will get you close to what you want (although not necessarily efficiently). ndx <- gregexpr("[+-]?(?:\\d+(?:\\.\\d*)|\\.\\d+)(?:[eE][+-]?\\d+)?",txt_line) matched <- regmatches(txt_line, ndx) matched Hope this is helpful, Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204 ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by emorway
Hello,
I've just read your follow-up question on regular expressions, and I believe this, your original problem, can be made much faster. Just use readLine() differently, reading large amounts of text lines at a time. For this to work you will still need to know the total number of lines in the file. fun <- function(con, pattern, nlines, n=5000L){ if(is.character(con)){ con <- file(con, open="rt") on.exit(close(con)) } passes <- nlines %/% n remaining <- nlines %% n res <- NULL for(i in seq_len(passes)){ txt <- readLines(con, n=n) res <- c(res, as.numeric(substr(txt[grepl(pattern, txt)], 70, 78))) } if(remaining){ txt <- readLines(con, n=remaining) res <- c(res, as.numeric(substr(txt[grepl(pattern, txt)], 70, 78))) } res } url <- "http://r.789695.n4.nabble.com/file/n4632558/MCR.out" pat <- "PERCENT DISCREPANCY =" num_lines <- 14405247L # your original txt_con<-file(description=url,open="r") pd <- NULL t1 <- system.time( for(i in 1:num_lines){ txt_line<-readLines(txt_con,n=1) if (length(grep(pat,txt_line))) { pd<-c(pd,as.numeric(substr(txt_line,70,78))) } } ) close(txt_con) # the function above, increased 'n' t2 <- system.time(pd2 <- fun(url, pat, num_lines, 100000L)) all.equal(pd, pd2) [1] TRUE rbind(original=t1, fun=t2, ratio=t1/t2) user.self sys.self elapsed user.child sys.child original 780.16 196.16 981.9100 NA NA fun 0.10 0.04 3.2000 NA NA ratio 7801.60 4904.00 306.8469 NA NA A factor of 300. Hope this helps, Rui Barradas Em 06-06-2012 17:54, emorway escreveu: > useRs- > > I'm attempting to scan a more than 1Gb text file and read and store the > values that follow a specific key-phrase that is repeated multiple time > throughout the file. A snippet of the text file I'm trying to read is > attached. The text file is a dumping ground for various aspects of the > performance of the model that generates it. Thus, the location of > information I'm wanting to extract from the file is not in a fixed position > (i.e. it does not always appears in a predictable location, like line 1000, > or 2000, etc.). Rather, the desired values always follow a specific phrase: > " PERCENT DISCREPANCY =" > > One approach I took was the following: > > library(R.utils) > > txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r") > #The above will need to be altered if one desires to test code on the > attached txt file, which will run much quicker > system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out")) > #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon > num_lines > #14405247 > > system.time( > for(i in 1:num_lines){ > txt_line<-readLines(txt_con,n=1) > if (length(grep(" PERCENT DISCREPANCY =",txt_line))) { > pd<-c(pd,as.numeric(substr(txt_line,70,78))) > } > } > ) > #Time took about 5 minutes > > The inefficiencies in this approach arise due to reading the file twice > (first to get num_lines, then to step through each line looking for the > desired text). > > Is there a way to speed this process up through the use of a ?scan ? I > wan't able to get anything working, but what I had in mind was scan through > the more than 1Gb file and when the keyphrase (e.g. " PERCENT > DISCREPANCY = ") is encountered, read and store the next 13 characters > (which will include some white spaces) as a numeric value, then resume the > scan until the key phrase is encountered again and repeat until the > end-of-the-file marker is encountered. Is such an approach even possible or > is line-by-line the best bet? > > http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out > > > > -- > View this message in context: http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by Nordlund, Dan (DSHS/RDA)
Hi Dan and Rui, Thank you for the suggestions, both were very helpful. Rui's code was quite fast...there is one more thing I want to explore for my own edification, but first I need some help fixing the code below, which is a slight modification to Dan's suggestion. It'll no doubt be tough to beat the time Rui's code finished the task in, but I'm willing to try. First, I need to fix the following, which 'peels' the wrong bit of text from "txt_line". Instead of extracting as it now does (shown below), can the code be modified to extract the values 0.01 and -0.05, and store them in the variable 'extracted'?
txt_line<-" PERCENT DISCREPANCY = 0.01 PERCENT DISCREPANCY = -0.05" extracted <- strsplit(gsub("[+-]?(?:\\d+(?:\\.\\d*)|\\.\\d+)(?:[eE][+-]?\\d+)?","\\1%&",txt_line),"%&") extracted #[1] " PERCENT DISCREPANCY = " " PERCENT DISCREPANCY = " |
|
Hello,
Just put the entire regexp between parenthesis. extracted <- strsplit(gsub("([+-]?(?:\\d+(?:\\.\\d*)|\\.\\d+)(?:[eE][+-]?\\d+)?)","\\1%&",txt_line),"%&") extracted sapply(strsplit(unlist(extracted), "="), "[", 2) As for speed, I believe that this might take longer. It will have to match a regular expression, then substitute, then split. A routine like the one I've send usually gives an order of magnitude or more. The first time I've written one was around 20 years ago, I can now write it with my eyes closed and it consistently beats alternatives but there's no harm in trying. Or in combining strategies. Good luck. Rui Barradas Em 08-06-2012 04:52, emorway escreveu: > Hi Dan and Rui, Thank you for the suggestions, both were very helpful. > Rui's code was quite fast...there is one more thing I want to explore for my > own edification, but first I need some help fixing the code below, which is > a slight modification to Dan's suggestion. It'll no doubt be tough to beat > the time Rui's code finished the task in, but I'm willing to try. First, I > need to fix the following, which 'peels' the wrong bit of text from > "txt_line". Instead of extracting as it now does (shown below), can the > code be modified to extract the values 0.01 and -0.05, and store them in the > variable 'extracted'? > > txt_line<-" PERCENT DISCREPANCY = 0.01 PERCENT DISCREPANCY = > -0.05" > extracted <- > strsplit(gsub("[+-]?(?:\\d+(?:\\.\\d*)|\\.\\d+)(?:[eE][+-]?\\d+)?","\\1%&",txt_line),"%&") > extracted > #[1] " PERCENT DISCREPANCY = " " PERCENT DISCREPANCY = > " > > > > -- > View this message in context: http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558p4632753.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [hidden email] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by emorway
On Thu, Jun 7, 2012 at 1:40 PM, emorway <[hidden email]> wrote:
> Thanks for your suggestions. Bert, in your response you raised my awareness > to "regular expressions". Are regular expressions the same across various > languages? Consider the following line of text: > > txt_line<-" PERCENT DISCREPANCY = 0.01 PERCENT DISCREPANCY = > -0.05" > > It seems python uses the following line of code to extract the two values in > "txt_line" and store them in a variable called "v": > > v = re.findall("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?", line) > #v[0] 0.01 > #v[1] -0.05 > > I tried something similar in R (but it didn't work) by using the same > regular expression, but got an error: > > edm<-grep("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?",txt_line) > #Error: '\d' is an unrecognized escape in character string starting "[+-]? > *(?:\d" > > I'm not even sure which function in R most efficiently extracts the values > from "txt_line". Basically, I want to peel out the values and think I can > use the decimal point to construct the regular expression, but don't know > where to go from here? Try this. strapply applies the function (3rd argument) to each match of the regular expressoin (2nd argument) outputting the result of the function. The regular expression we have used matches a minus or digit followed by non-spaces. That seems good enough for this simple example but, of course, it can be changed. > library(gsubfn) > p <- "[-0-9]\\S+" > txt_line <- " PERCENT DISCREPANCY = 0.01 PERCENT DISCREPANCY = -0.05" > > strapply(txt_line, p, as.numeric)[[1]] [1] 0.01 -0.05 or using strapplyc (which is similar but uses c as the function) and is optimized for speed: > as.numeric(strapplyc(txt_line, p)[[1]]) [1] 0.01 -0.05 If we are only parsing a few lines then the speed does not matter but if there are large amounts to parse then be sure to have the tcltk package installed to get the best speed from the gsubfn functions (on Windows and most but not all Linux systems tcltk is installed by default but on a few you have to do it yourself). If you don't have tcltk the gsubfn package will use R which is slower. Also, as noted, strapplyc is faster than strapply. There are arguments and options that can override the defaults. The gsubfn home page is at http://gsubfn.googlecode.com regular expressions are largely the same but not 100% identical across languages. There are some links to regular expression info in different languages at the bottom of the home page just listed. R can use R or perl regular expressions and the gsubfn functions, in addition, can use tcl regular expressions. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
In reply to this post by emorway
On Wed, Jun 6, 2012 at 12:54 PM, emorway <[hidden email]> wrote:
> useRs- > > I'm attempting to scan a more than 1Gb text file and read and store the > values that follow a specific key-phrase that is repeated multiple time > throughout the file. A snippet of the text file I'm trying to read is > attached. The text file is a dumping ground for various aspects of the > performance of the model that generates it. Thus, the location of > information I'm wanting to extract from the file is not in a fixed position > (i.e. it does not always appears in a predictable location, like line 1000, > or 2000, etc.). Rather, the desired values always follow a specific phrase: > " PERCENT DISCREPANCY =" > > One approach I took was the following: > > library(R.utils) > > txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r") > #The above will need to be altered if one desires to test code on the > attached txt file, which will run much quicker > system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out")) > #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon > num_lines > #14405247 > > system.time( > for(i in 1:num_lines){ > txt_line<-readLines(txt_con,n=1) > if (length(grep(" PERCENT DISCREPANCY =",txt_line))) { > pd<-c(pd,as.numeric(substr(txt_line,70,78))) > } > } > ) > #Time took about 5 minutes > > The inefficiencies in this approach arise due to reading the file twice > (first to get num_lines, then to step through each line looking for the > desired text). > > Is there a way to speed this process up through the use of a ?scan ? I > wan't able to get anything working, but what I had in mind was scan through > the more than 1Gb file and when the keyphrase (e.g. " PERCENT > DISCREPANCY = ") is encountered, read and store the next 13 characters > (which will include some white spaces) as a numeric value, then resume the > scan until the key phrase is encountered again and repeat until the > end-of-the-file marker is encountered. Is such an approach even possible or > is line-by-line the best bet? > > http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out Try this: g <- function(url, string, from, to, ...) { L <- readLines(url) matched <- grep(string, L, value = TRUE, ...) as.numeric(substring(matched, from, to)) } > url <- "http://r.789695.n4.nabble.com/file/n4632558/MCR.out" > g(url, "PERCENT DISCREPANCY = ", 70, 78, fixed = TRUE) [1] NA 0.00 -0.01 NA 0.00 -0.01 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com ______________________________________________ [hidden email] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
|
I'll summarize the results in terms of total run time for the suggestions that have been made as well as post the code for those that come across this post in the future. First the results (the code for which is provided second):
What I tried to do using suggestions from Bert and Dan: t1 # user system elapsed # 208.21 1.68 210.34 Gabor's suggested code: t2 # user system elapsed # 51.12 0.63 51.75 Rui's suggested code: t3a #(Get the number of lines) # user system elapsed # 45.13 11.08 56.23 t3b #(now perform the function # user system elapsed # 50.59 0.55 51.16 So in summary it appears that Gabor's and Rui's code are quite similar (in terms of runtime) if a priori knowledge of the number of lines in the file is known (e.g. t2 is roughly equal to t3b). It would seem Gabor's code is a little more robust since it doesn't require the total number of lines in the file be supplied. And here is the code used to get these times (note that the file I used was the 1GB text file, not the reduced version attached to the top post of this thread): #---------------- #modified attempt #---------------- library(gsubfn) library(tcltk2) p<-"[-0-9]\\S+" pd<-numeric() txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r") t1<-system.time( while (length(txt_line<-readLines(txt_con,n=1))){ if (length(grep("DISCREPANCY = ",txt_line))) { pd<-c(pd,as.numeric(strapplyc(txt_line, p)[[1]])) } }) close(txt_con) t1 # user system elapsed # 208.21 1.68 210.34 #---------------------------- #Suggested by G. Grothendieck #---------------------------- g<-function(txt_con, string, from, to, ...) { L <- readLines(txt_con) matched <- grep(string, L, value = TRUE, ...) as.numeric(substring(matched, from, to)) } txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r") t2<-system.time( edm<-g(txt_con, "PERCENT DISCREPANCY = ", 70, 78, fixed = TRUE) ) close(txt_con) t2 # user system elapsed # 51.12 0.63 51.75 #------------------------- #Suggested by Rui Barradas #------------------------- library(R.utils) t3a<-system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out")) t3a # user system elapsed # 45.13 11.08 56.23 fun <- function(con, pattern, nlines, n=5000L){ if(is.character(con)){ con <- file(con, open="rt") on.exit(close(con)) } passes <- nlines %/% n remaining <- nlines %% n res <- NULL for(i in seq_len(passes)){ txt <- readLines(con, n=n) res <- c(res, as.numeric(substr(txt[grepl(pattern, txt)], 70, 78))) } if(remaining){ txt <- readLines(con, n=remaining) res <- c(res, as.numeric(substr(txt[grepl(pattern, txt)], 70, 78))) } res } txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r") pat<-"PERCENT DISCREPANCY =" num_lines <- 14405247L t3b <- system.time(pd2 <- fun(txt_con, pat, num_lines, 100000L)) close(txt_con) t3b # user system elapsed # 50.59 0.55 51.16 |
| Powered by Nabble | Edit this page |
