Parallel Scan of Large File

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Parallel Scan of Large File

Ryan Garner
Is it possible to parallel scan a large file into a character vector in 1M chunks using scan() with the "doMC" package? Furthermore, can I specify the tasks for each child?

i.e. I'm working on a Linux box with 8 cores and would like to scan in 8M records at time (all 8 cores scan 1M records at a time) from a file with 40M records total.

file <- file("data.txt","r")
child <- foreach(i = icount(40)) %dopar%
{
    scan(file,what = "character",sep = "\n",skip = 0,nlines = 1e6)
}

Thus, each child would have a different skip argument. child[[1]]: skip = 0, child[[2]]: skip = 1e6 + 1, child[[3]]: skip = 2e6 + 1, ... ,child[[40]]: skip = 39e6 + 1. I would then end up with a list of 40 vectors with child[[1]] containing records 1 to 1000000, child[[2]] containing records 1000001 to 2000000, ... ,child[[40]] containing records 39000001 to 40000000.

Also, would one file connection suffice or does their need to be a file connection that opens and closes for each child?



Reply | Threaded
Open this post in threaded view
|

Re: Parallel Scan of Large File

Mike Marchywka






----------------------------------------

> Date: Tue, 7 Dec 2010 17:22:57 -0800
> From: [hidden email]
> To: [hidden email]
> Subject: [R] Parallel Scan of Large File
>
>
> Is it possible to parallel scan a large file into a character vector in 1M
> chunks using scan() with the "doMC" package? Furthermore, can I specify the
> tasks for each child?
>
> i.e. I'm working on a Linux box with 8 cores and would like to scan in 8M
> records at time (all 8 cores scan 1M records at a time) from a file with 40M
> records total.

I can't comment on R approaches but if your rational here is speed
and you hope to scale this up to bigger files I would suggest more
analysis or measurement. In the case you outline, disk IO is probably
going to be the rate limiting step. It usually helps if you can make
thing predictable so the disk and memory caches can be used efficiently.
If you split up disk IO among different threads there is no reasonable
way the hardware can figure out what access is likely to be next.
Further, often times things like "skip()" are implemented as dummy reads
on sequential file access calls.

If you pursue this, I'd be curious to see what kind of results you get as
you go from 1 to 8 core with larger files.

You would probably be better off if you could find a way to pipeline this
work rather than split it up as you suggest. The idea sounds good of course,
you end up with 8 cores looking at your text, but you could easily be limited
by some other resource like bus bandwidths to disk or memory. As each core
gets a bigger junk, eventually you run out of physical memory and then of
course you are just doing disk IO for VM.



>
> file <- file("data.txt","r")
> child <- foreach(i = icount(40)) %dopar%
> {
> scan(file,what = "character",sep = "\n",skip = 0,nlines = 1e6)
> }
>
> Thus, each child would have a different skip argument. child[[1]]: skip = 0,
> child[[2]]: skip = 1e6 + 1, child[[3]]: skip = 2e6 + 1, ... ,child[[40]]:
> skip = 39e6 + 1. I would then end up with a list of 40 vectors with
> child[[1]] containing records 1 to 1000000, child[[2]] containing records
> 1000001 to 2000000, ... ,child[[40]] containing records 39000001 to
> 40000000.
>
> Also, would one file connection suffice or does their need to be a file
> connection that opens and closes for each child?
>
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Parallel-Scan-of-Large-File-tp3077545p3077545.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
     
______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Parallel Scan of Large File

jholtman
You might be better off partitioning the file before processing with
R.  If you are planning on using "skip = n" to skip over records
before processing, then the last thread you would start would have to
read through 7/8 of the file before starting.  The actual I/O, plus
the looking for the line feed to determine the end of the line might
take as much time as just processing the file in a single thread
depending on the complexity of the analysis.  If you expect to be
making multiple passes of the file (especially while debugging the
code), then I would recommend that you split the file into the sizes
you want to process with each thread one time, and then use the
partitions in your parallel processing so that you don't waste a lot
of I/O just reading to fine a line feed.  Also if your system has a
number of I/O devices, you would put each partition on a separate
spindle and potentially reduce the bottleneck of trying to have eight
threads read from the same disk causing a lot of head seeking
activity.

On Wed, Dec 8, 2010 at 8:24 AM, Mike Marchywka <[hidden email]> wrote:

>
>
>
>
>
>
> ----------------------------------------
>> Date: Tue, 7 Dec 2010 17:22:57 -0800
>> From: [hidden email]
>> To: [hidden email]
>> Subject: [R] Parallel Scan of Large File
>>
>>
>> Is it possible to parallel scan a large file into a character vector in 1M
>> chunks using scan() with the "doMC" package? Furthermore, can I specify the
>> tasks for each child?
>>
>> i.e. I'm working on a Linux box with 8 cores and would like to scan in 8M
>> records at time (all 8 cores scan 1M records at a time) from a file with 40M
>> records total.
>
> I can't comment on R approaches but if your rational here is speed
> and you hope to scale this up to bigger files I would suggest more
> analysis or measurement. In the case you outline, disk IO is probably
> going to be the rate limiting step. It usually helps if you can make
> thing predictable so the disk and memory caches can be used efficiently.
> If you split up disk IO among different threads there is no reasonable
> way the hardware can figure out what access is likely to be next.
> Further, often times things like "skip()" are implemented as dummy reads
> on sequential file access calls.
>
> If you pursue this, I'd be curious to see what kind of results you get as
> you go from 1 to 8 core with larger files.
>
> You would probably be better off if you could find a way to pipeline this
> work rather than split it up as you suggest. The idea sounds good of course,
> you end up with 8 cores looking at your text, but you could easily be limited
> by some other resource like bus bandwidths to disk or memory. As each core
> gets a bigger junk, eventually you run out of physical memory and then of
> course you are just doing disk IO for VM.
>
>
>
>>
>> file <- file("data.txt","r")
>> child <- foreach(i = icount(40)) %dopar%
>> {
>> scan(file,what = "character",sep = "\n",skip = 0,nlines = 1e6)
>> }
>>
>> Thus, each child would have a different skip argument. child[[1]]: skip = 0,
>> child[[2]]: skip = 1e6 + 1, child[[3]]: skip = 2e6 + 1, ... ,child[[40]]:
>> skip = 39e6 + 1. I would then end up with a list of 40 vectors with
>> child[[1]] containing records 1 to 1000000, child[[2]] containing records
>> 1000001 to 2000000, ... ,child[[40]] containing records 39000001 to
>> 40000000.
>>
>> Also, would one file connection suffice or does their need to be a file
>> connection that opens and closes for each child?
>>
>>
>>
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/Parallel-Scan-of-Large-File-tp3077545p3077545.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> [hidden email] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



--
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Parallel Scan of Large File

Ryan Garner
This post was updated on .
Hi Jim,

Thanks for your insight. I used Linux split to split my large file into smaller partitions. On the server I work on, multipath I/O access is enabled and we use RAID for storage; thus, I don't think I can put each partition on a spindle. I'm able to open multiple files at a time into stdin from the command line:

> cat file1.txt | wc -l & # Cpu 1
> cat file2.txt | wc -l & # Cpu 2
> cat file3.txt | wc -l & # Cpu 3
> cat file4.txt | wc -l & # Cpu 4

But I'm still not sure how to read each partition in parallel. When I run this code, it doesn't run in parallel, instead file.list gets filled with 1 cpu doing all the work.

R> library(doMC)
R> registerDoMC(cores = 4)
R> files <- Sys.glob("x*")                                               # Grabs all the file partitions created by split
R> file.list <- lapply(files,function(x){file(x,"r")})                 # Creates all the file partitions connections
R> master <- foreach (i = icount(length(file.list))) %dopar% # Attempt at parallel readLines
+{
+ readLines(file.list[[i]],1000000)
+}
Reply | Threaded
Open this post in threaded view
|

Re: Parallel Scan of Large File

Luedde, Mirko
In reply to this post by Ryan Garner
Hi Ryan,

the "Getting Started with doMC and foreach" manual
tells me that you might have forgotten a

  registerDoMC()

in the example you provided.

Best, Mirko

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.