Efficient Merging of two huge sorted data frames?---Use merge()?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Efficient Merging of two huge sorted data frames?---Use merge()?

Charles Cheung
Hello all,

A problem I encounter today is the speed which takes to sort two huge data
frames...

I wish to sort by (X,Y)

Dataframe One consists of variables:
X, Y, sequence, position
having ~700 000 records

another dataframe consists of
X,Y, intensities
having ~900 000 records


Every (X,Y) pair in dataframe One is included in dataframe Two,
however,  the reverse is not true.
Furthermore,  (X,Y, position) in data frame One makes the record unique.
(That means there can be multiple records with the same (X,Y) records!)

Added together, it makes it hard to just combine the two data frames
together by simply going
data.frame(dataFrameOne, dataFrameTwo) because the mapping won't correspond
even in sorted records by X and Y.


Intuitive, it should only require very little time <O(n) complexity> after
the data records are sorted.
However, it takes so long (I haven't finished the process in 20 minutes.. it
should only take <1 min) to merge the list by X and Y using

merge(dataFrameOne, dataFrameTwo, by=c("X","Y") , which leads me to suspect
this process is not optimized for already sorted list.

* assuming the two frames have been sorted, I would be able to do the
following:


X Y seq Pos
1 1   AA  32
1 2   AG  44
1 3   GC  65


X Y intensities
1 1  0.4
1 3  0.552

>>Cursor at beginning (1,1) (1,1) -->merge the (1,1) pair.. then cursor
>>moves to (1,2) (1,3)  --> can't find..     cursor moves to (1,3) (1,3) ..
>>merge that pair

Is the merge function doing that already?


Is there an efficient way to merge the data frames? (What do you suggest I
should do?)


(to produce)
X Y seq pos intensities
1 1 AA   32     0.4
1 3 GC  65     0.552

Thank you in advance!


Charles Cheung

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Reply | Threaded
Open this post in threaded view
|

Re: Efficient Merging of two huge sorted data frames?---Use merge()?

Prof Brian Ripley
merge() is not optimized for large data frames.  To do things on this
scale you really want to be using a DBMS not R.  See the `R Data
Import/Export Manual'.

Sorting is not really relevant, especially as merge is not assuming that
the match is unique.  Hashing could be used, but is not.

As R is open source, you have the source code and it would be kinder to
read it yourself rather than expect this list to read it for you.  A
useful contribution to the R project would be to contribute a more
efficient version, and we look forwards to seeing your contribution.

On Mon, 8 May 2006, Charles Cheung wrote:

> Hello all,
>
> A problem I encounter today is the speed which takes to sort two huge data
> frames...
>
> I wish to sort by (X,Y)
>
> Dataframe One consists of variables:
> X, Y, sequence, position
> having ~700 000 records
>
> another dataframe consists of
> X,Y, intensities
> having ~900 000 records
>
>
> Every (X,Y) pair in dataframe One is included in dataframe Two,
> however,  the reverse is not true.
> Furthermore,  (X,Y, position) in data frame One makes the record unique.
> (That means there can be multiple records with the same (X,Y) records!)
>
> Added together, it makes it hard to just combine the two data frames
> together by simply going
> data.frame(dataFrameOne, dataFrameTwo) because the mapping won't correspond
> even in sorted records by X and Y.
>
>
> Intuitive, it should only require very little time <O(n) complexity> after
> the data records are sorted.
> However, it takes so long (I haven't finished the process in 20 minutes.. it
> should only take <1 min) to merge the list by X and Y using
>
> merge(dataFrameOne, dataFrameTwo, by=c("X","Y") , which leads me to suspect
> this process is not optimized for already sorted list.
>
> * assuming the two frames have been sorted, I would be able to do the
> following:
>
>
> X Y seq Pos
> 1 1   AA  32
> 1 2   AG  44
> 1 3   GC  65
>
>
> X Y intensities
> 1 1  0.4
> 1 3  0.552
>
>>> Cursor at beginning (1,1) (1,1) -->merge the (1,1) pair.. then cursor
>>> moves to (1,2) (1,3)  --> can't find..     cursor moves to (1,3) (1,3) ..
>>> merge that pair
>
> Is the merge function doing that already?
>
>
> Is there an efficient way to merge the data frames? (What do you suggest I
> should do?)
>
>
> (to produce)
> X Y seq pos intensities
> 1 1 AA   32     0.4
> 1 3 GC  65     0.552
>
> Thank you in advance!
>
>
> Charles Cheung
>
> ______________________________________________
> [hidden email] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

--
Brian D. Ripley,                  [hidden email]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel