Efficient means to link two data frames

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Efficient means to link two data frames

Fisher Dennis
R 3.3.1
OS X

Colleagues,

I have two large data frames that I am trying to link efficiently.   A small example is as follows:

structure(list(Day = c(1L, 2L, 3L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
13L, 14L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L,
27L, 28L, 29L, 30L), Value = c(76, 116, 111, 103, 114, 99, 128,
96, 81, 84, 81, 108, 109, 106, 125, 128, 92, 90, 83, 89, 76,
89, 101, 93, 98, 77, 92)), .Names = c("Day", "Value"), row.names = c(NA,
-27L), class = "data.frame")

which becomes:
  Day Value
1    1    76
2    2   116
3    3   111
4    5   103
5    6   114
6    7    99
7    8   128
8    9    96
9   10    81
10  11    84
11  13    81
12  14   108
13  16   109
14  17   106
15  18   125
16  19   128
17  20    92
18  21    90
19  22    83
20  23    89
21  24    76
22  25    89
23  26   101
24  27    93
25  28    98
26  29    77
27  30    92

The second dataframe is merely:
TESTDAYS <- data.frame(TestDay = c(4, 11, 15))

For each row in the second dataframe, I would like to identify the first row in the first dataframe in which Day is >= TestDay.
For example, for TestDay == 4, Day would equal 5.  I would then append the corresponding “Value” in the TestValue column
The result would be:
 TestDay TestValue
1       4       103
2      11        84
3      15       109

I can accomplish this with brute force but I suspect that there is some clever day to vectorize this.  Any help would be appreciated.

Dennis

Dennis Fisher MD
P < (The "P Less Than" Company)
Phone / Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com <http://www.plessthan.com/>





        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Efficient means to link two data frames

Ben Tupper-2
Hi,

I have a solution based upon findInterval() which depends upon the ordered nature of the 'Day' column.  I can't speak to whether or not it is efficient but it is handy.  I love the findInterval() function but have often wished it works with look up tables in descending order.  This function, find_interval(), is my first reasonable pass at working with descending order.

https://gist.github.com/btupper/5fc6cc4e7d86f39f9e4f

## start
source("find_interval.R")

x <- structure(list(Day = c(1L, 2L, 3L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
13L, 14L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L,
27L, 28L, 29L, 30L), Value = c(76, 116, 111, 103, 114, 99, 128,
96, 81, 84, 81, 108, 109, 106, 125, 128, 92, 90, 83, 89, 76,
89, 101, 93, 98, 77, 92)), .Names = c("Day", "Value"), row.names = c(NA,
-27L), class = "data.frame")

TESTDAYS <- data.frame(TestDay = c(4, 11, 15))

x <- x[rev(1:nrow(x)),]

ix <- find_interval(TESTDAYS[,'TestDay'], x[,'Day'])

TESTDAYS[,'TestValue'] <- x[ix,'Value']

TESTDAYS
#   TestDay TestValue
# 1       4       103
# 2      11        84
# 3      15       109

### end

Will that do the trick with your large dataset?

Ben


> On Sep 29, 2016, at 9:38 AM, Dennis Fisher <[hidden email]> wrote:
>
> R 3.3.1
> OS X
>
> Colleagues,
>
> I have two large data frames that I am trying to link efficiently.   A small example is as follows:
>
> structure(list(Day = c(1L, 2L, 3L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
> 13L, 14L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L,
> 27L, 28L, 29L, 30L), Value = c(76, 116, 111, 103, 114, 99, 128,
> 96, 81, 84, 81, 108, 109, 106, 125, 128, 92, 90, 83, 89, 76,
> 89, 101, 93, 98, 77, 92)), .Names = c("Day", "Value"), row.names = c(NA,
> -27L), class = "data.frame")
>
> which becomes:
>  Day Value
> 1    1    76
> 2    2   116
> 3    3   111
> 4    5   103
> 5    6   114
> 6    7    99
> 7    8   128
> 8    9    96
> 9   10    81
> 10  11    84
> 11  13    81
> 12  14   108
> 13  16   109
> 14  17   106
> 15  18   125
> 16  19   128
> 17  20    92
> 18  21    90
> 19  22    83
> 20  23    89
> 21  24    76
> 22  25    89
> 23  26   101
> 24  27    93
> 25  28    98
> 26  29    77
> 27  30    92
>
> The second dataframe is merely:
> TESTDAYS <- data.frame(TestDay = c(4, 11, 15))
>
> For each row in the second dataframe, I would like to identify the first row in the first dataframe in which Day is >= TestDay.
> For example, for TestDay == 4, Day would equal 5.  I would then append the corresponding “Value” in the TestValue column
> The result would be:
> TestDay TestValue
> 1       4       103
> 2      11        84
> 3      15       109
>
> I can accomplish this with brute force but I suspect that there is some clever day to vectorize this.  Any help would be appreciated.
>
> Dennis
>
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone / Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com <http://www.plessthan.com/>
>
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Ben Tupper
Bigelow Laboratory for Ocean Sciences
60 Bigelow Drive, P.O. Box 380
East Boothbay, Maine 04544
http://www.bigelow.org

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.