# Complicated analysis for huge databases

12 messages
Open this post in threaded view
|

## Complicated analysis for huge databases

 Hi all .., I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :- > MyData        Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600 1    33                 55             1             0           1        2       0 2    33                 55              3             1          0        2        2 3    33                 55              5             2          1        1         2 4    44                 66               7            0          2         2        2 5   44                  66               4            1          1          0       1 6   44                  66                9            2          0          1       2 . . 600,000 I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on). I can do the analysis  for the entire column but not group by group like this : MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1))) How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns. In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns. I have another sheet contains names of columns of interest like this : >ColOfinterest Col I IV V . . 300 Any one would help with the best combination of syntax to perform this complex analysis? Regards Allaisone         [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Complicated analysis for huge databases

 Combine columns 1 and 2 into a column with a single ID like "33.55", "44.66" and use split() on these IDs to break up your dataset. Iterate over the list of data frames split() returns. B. > On Nov 17, 2017, at 12:59 PM, Allaisone 1 <[hidden email]> wrote: > > > Hi all .., > > > I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :- > > >> MyData > >       Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600 > > 1    33                 55             1             0           1        2       0 > > 2    33                 55              3             1          0        2        2 > > 3    33                 55              5             2          1        1         2 > > 4    44                 66               7            0          2         2        2 > > 5   44                  66               4            1          1          0       1 > > 6   44                  66                9            2          0          1       2 > > . > > . > > 600,000 > > > > I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on). > > > I can do the analysis  for the entire column but not group by group like this : > > > MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1))) > > How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns. > > In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns. > I have another sheet contains names of columns of interest like this : > >> ColOfinterest > > Col > I > IV > V > . > . > 300 > > Any one would help with the best combination of syntax to perform this complex analysis? > > Regards > Allaisone > > > > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Complicated analysis for huge databases

 Thanks Boris , this was very helpful but I'm struggling with the last part. 1) I combined the first 2 columns :- library(tidyr) SingleMealsCode <-unite(MyData, MealsCombinations, c(MealA, MealB), remove=FALSE) SingleMealsCode <- SingleMealsCode[,-2]   2) I separated this dataframe into different dataframes based on "MealsCombination"    column so R will recognize each meal combination separately : SeparatedGroupsofmealsCombs <- split(SingleMealCode,SingleMealCode\$MealsCombinations) after investigating the structure of "SeparatedGroupsofmealsCombs" , I can see a list of different databases, each of which represents a different Meal combinations which is great. No, I'm struggling with the last part, how can I run the maf code for all dataframes? when I run this code as before :- maf <- apply(SeparatedGroupsofmealsCombs, 2, function(x)maf(tabulate(x+1))) an error message says : dim(X) must have a positive length . I'm not sure which length I need to specify.. any suggestions to correct this syntax ? Regards Allaisone ________________________________ From: Boris Steipe <[hidden email]> Sent: 17 November 2017 21:12:06 To: Allaisone 1 Cc: R-help Subject: Re: [R] Complicated analysis for huge databases Combine columns 1 and 2 into a column with a single ID like "33.55", "44.66" and use split() on these IDs to break up your dataset. Iterate over the list of data frames split() returns. B. > On Nov 17, 2017, at 12:59 PM, Allaisone 1 <[hidden email]> wrote: > > > Hi all .., > > > I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :- > > >> MyData > >       Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600 > > 1    33                 55             1             0           1        2       0 > > 2    33                 55              3             1          0        2        2 > > 3    33                 55              5             2          1        1         2 > > 4    44                 66               7            0          2         2        2 > > 5   44                  66               4            1          1          0       1 > > 6   44                  66                9            2          0          1       2 > > . > > . > > 600,000 > > > > I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on). > > > I can do the analysis  for the entire column but not group by group like this : > > > MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1))) > > How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns. > > In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns. > I have another sheet contains names of columns of interest like this : > >> ColOfinterest > > Col > I > IV > V > . > . > 300 > > Any one would help with the best combination of syntax to perform this complex analysis? > > Regards > Allaisone > > > > > > > >        [[alternative HTML version deleted]] > > ______________________________________________ > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.         [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Complicated analysis for huge databases

 Something like the following? AllMAFs <- list() for (i in length(SeparatedGroupsofmealsCombs) {   AllMAFs[[i]] <- apply(SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf(tabulate(x+1))) } (untested, of course) Also the solution is a bit generic since I don't know what the output of maf() looks like in your case, and I don't understand why you use tabulate because I would have assumed that's what maf() does - but that's not for me to worry about :-) B. > On Nov 17, 2017, at 7:15 PM, Allaisone 1 <[hidden email]> wrote: > > > Thanks Boris , this was very helpful but I'm struggling with the last part. > > 1) I combined the first 2 columns :- > >   > library(tidyr) > SingleMealsCode <-unite(MyData, MealsCombinations, c(MealA, MealB), remove=FALSE) > SingleMealsCode <- SingleMealsCode[,-2] > >   2) I separated this dataframe into different dataframes based on "MealsCombination" >    column so R will recognize each meal combination separately : > > SeparatedGroupsofmealsCombs <- split(SingleMealCode,SingleMealCode\$MealsCombinations) > > after investigating the structure of "SeparatedGroupsofmealsCombs" , I can see > a list of different databases, each of which represents a different Meal combinations which is great. > > No, I'm struggling with the last part, how can I run the maf code for all dataframes? > > when I run this code as before :- > > maf <- apply(SeparatedGroupsofmealsCombs, 2, function(x)maf(tabulate(x+1))) > > an error message says : dim(X) must have a positive length . I'm not sure which length > I need to specify.. any suggestions to correct this syntax ?   > > Regards > Allaisone > From: Boris Steipe <[hidden email]> > Sent: 17 November 2017 21:12:06 > To: Allaisone 1 > Cc: R-help > Subject: Re: [R] Complicated analysis for huge databases >   > Combine columns 1 and 2 into a column with a single ID like "33.55", "44.66" and use split() on these IDs to break up your dataset. Iterate over the list of data frames split() returns. > > > B. > > > On Nov 17, 2017, at 12:59 PM, Allaisone 1 <[hidden email]> wrote: > > > > > > Hi all .., > > > > > > I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :- > > > > > >> MyData > > > >       Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600 > > > > 1    33                 55             1             0           1        2       0 > > > > 2    33                 55              3             1          0        2        2 > > > > 3    33                 55              5             2          1        1         2 > > > > 4    44                 66               7            0          2         2        2 > > > > 5   44                  66               4            1          1          0       1 > > > > 6   44                  66                9            2          0          1       2 > > > > . > > > > . > > > > 600,000 > > > > > > > > I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on). > > > > > > I can do the analysis  for the entire column but not group by group like this : > > > > > > MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1))) > > > > How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns. > > > > In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns. > > I have another sheet contains names of columns of interest like this : > > > >> ColOfinterest > > > > Col > > I > > IV > > V > > . > > . > > 300 > > > > Any one would help with the best combination of syntax to perform this complex analysis? > > > > Regards > > Allaisone > > > > > > > > > > > > > > > >        [[alternative HTML version deleted]] > > > > ______________________________________________ > > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Complicated analysis for huge databases

 In reply to this post by Boris Steipe Or do it at one go using ?tapply and friends Bert On Nov 17, 2017 1:12 PM, "Boris Steipe" <[hidden email]> wrote: > Combine columns 1 and 2 into a column with a single ID like "33.55", > "44.66" and use split() on these IDs to break up your dataset. Iterate over > the list of data frames split() returns. > > > B. > > > On Nov 17, 2017, at 12:59 PM, Allaisone 1 <[hidden email]> > wrote: > > > > > > Hi all .., > > > > > > I have a large dataset of around 600,000 rows and 600 columns. The first > col is codes for Meal A, the second columns is codes for Meal B. The third > column is customers IDs where each customer had a combination of meals. > Each column of the rest columns contains values 0,1,or 2. The dataset is > organised in a way so that the first group of customers had similar meals > combinations, this is followed by another group of customers with similar > meals combinations but different from the first group and so on. The > dataset looks like this :- > > > > > >> MyData > > > >       Meal A     Meal B     Cust.ID      I            II        III >  IV   ...... 600 > > > > 1    33                 55             1             0           1 >   2       0 > > > > 2    33                 55              3             1          0 >   2        2 > > > > 3    33                 55              5             2          1 >   1         2 > > > > 4    44                 66               7            0          2 >    2        2 > > > > 5   44                  66               4            1          1 >     0       1 > > > > 6   44                  66                9            2          0 >     1       2 > > > > . > > > > . > > > > 600,000 > > > > > > > > I wanted to find maf() for each column(from 4 to 600) after calculating > the frequency of the 3 values (0,1,2) but this should be done group by > group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on). > > > > > > I can do the analysis  for the entire column but not group by group like > this : > > > > > > MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1))) > > > > How can I modify this code to tell R to do the analysis group by group > for each column so I get maf value for 33-55 group of clolumn I, then maf > value for group 44-66 in the same column I,then the rest of groups in this > column and do the same for the remaining columns. > > > > In fact, I'm interested in doing this analysis for only 300 columns but > all of the 600 columns. > > I have another sheet contains names of columns of interest like this : > > > >> ColOfinterest > > > > Col > > I > > IV > > V > > . > > . > > 300 > > > > Any one would help with the best combination of syntax to perform this > complex analysis? > > > > Regards > > Allaisone > > > > > > > > > > > > > > > >       [[alternative HTML version deleted]] > > > > ______________________________________________ > > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help> > PLEASE do read the posting guide http://www.R-project.org/> posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/> posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >         [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide http://www.R-project.org/posting-guide.htmland provide commented, minimal, self-contained, reproducible code.
Open this post in threaded view
|

## Re: Complicated analysis for huge databases

Open this post in threaded view
|

## Re: Complicated analysis for huge databases

Open this post in threaded view
|

## Re: Complicated analysis for huge databases

Open this post in threaded view
|

## Re: Complicated analysis for huge databases

Open this post in threaded view
|