Quantcast

Efficiently reading random lines form a large file

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Efficiently reading random lines form a large file

Pablo Lewinger
I need to read two different random lines at a time from a large
ASCII file (120 x 296976) containing space delimited 0-1 entries.

The following code does the job and it's reasonable fast for my needs:

   lineNumber = sample(120, 2)
   line1 = scan(filename, what = "integer", skip=lineNumber[1]-1, nlines=1)
   line2 = scan(filename, what = "integer", skip=lineNumber[2]-1, nlines=1)

 > system.time(for (i in 50){
+   lineNumber = sample(120, 2)
+   line1 = scan(filename, what = "integer", skip=lineNumber[1]-1, nlines=1)
+   line2 = scan(filename, what = "integer", skip=lineNumber[2]-1, nlines=1)
+ })

Read 296976 items
Read 296976 items
[1] 14.24  0.12 14.51    NA    NA

However, I'm wondering if there's an even faster way to do this. Is there?

 > sessionInfo()
R version 2.4.1 (2006-12-18)
i386-pc-mingw32

Juan Pablo Lewinger
Department of Preventive Medicine
Keck School of Medicine
University of Southern California
1540 Alcazar Street, CHP-220
Los Angeles, CA 90089-9011, USA

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Efficiently reading random lines form a large file

Marc Schwartz
On Tue, 2007-05-15 at 16:02 -0700, Juan Pablo Lewinger wrote:

> I need to read two different random lines at a time from a large
> ASCII file (120 x 296976) containing space delimited 0-1 entries.
>
> The following code does the job and it's reasonable fast for my needs:
>
>    lineNumber = sample(120, 2)
>    line1 = scan(filename, what = "integer", skip=lineNumber[1]-1, nlines=1)
>    line2 = scan(filename, what = "integer", skip=lineNumber[2]-1, nlines=1)
>
>  > system.time(for (i in 50){
> +   lineNumber = sample(120, 2)
> +   line1 = scan(filename, what = "integer", skip=lineNumber[1]-1, nlines=1)
> +   line2 = scan(filename, what = "integer", skip=lineNumber[2]-1, nlines=1)
> + })
>
> Read 296976 items
> Read 296976 items
> [1] 14.24  0.12 14.51    NA    NA
>
> However, I'm wondering if there's an even faster way to do this. Is there?

You might want to take a look at this post by Jim Holtman from earlier
in the year for some ideas:

http://tolstoy.newcastle.edu.au/R/e2/help/07/02/9709.html

HTH,

Marc Schwartz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Loading...