Complicated analysis for huge databases

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Complicated analysis for huge databases

Allaisone 1

Hi all ..,


I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :-


> MyData

       Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600

1    33                 55             1             0           1        2       0

2    33                 55              3             1          0        2        2

3    33                 55              5             2          1        1         2

4    44                 66               7            0          2         2        2

5   44                  66               4            1          1          0       1

6   44                  66                9            2          0          1       2

.

.

600,000



I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).


I can do the analysis  for the entire column but not group by group like this :


MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))

How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns.

In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns.
I have another sheet contains names of columns of interest like this :

>ColOfinterest

Col
I
IV
V
.
.
300

Any one would help with the best combination of syntax to perform this complex analysis?

Regards
Allaisone







        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Complicated analysis for huge databases

Boris Steipe
Combine columns 1 and 2 into a column with a single ID like "33.55", "44.66" and use split() on these IDs to break up your dataset. Iterate over the list of data frames split() returns.


B.

> On Nov 17, 2017, at 12:59 PM, Allaisone 1 <[hidden email]> wrote:
>
>
> Hi all ..,
>
>
> I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :-
>
>
>> MyData
>
>       Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600
>
> 1    33                 55             1             0           1        2       0
>
> 2    33                 55              3             1          0        2        2
>
> 3    33                 55              5             2          1        1         2
>
> 4    44                 66               7            0          2         2        2
>
> 5   44                  66               4            1          1          0       1
>
> 6   44                  66                9            2          0          1       2
>
> .
>
> .
>
> 600,000
>
>
>
> I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
>
>
> I can do the analysis  for the entire column but not group by group like this :
>
>
> MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
>
> How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns.
>
> In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns.
> I have another sheet contains names of columns of interest like this :
>
>> ColOfinterest
>
> Col
> I
> IV
> V
> .
> .
> 300
>
> Any one would help with the best combination of syntax to perform this complex analysis?
>
> Regards
> Allaisone
>
>
>
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Complicated analysis for huge databases

Allaisone 1

Thanks Boris , this was very helpful but I'm struggling with the last part.

1) I combined the first 2 columns :-


library(tidyr)
SingleMealsCode <-unite(MyData, MealsCombinations, c(MealA, MealB), remove=FALSE)
SingleMealsCode <- SingleMealsCode[,-2]

  2) I separated this dataframe into different dataframes based on "MealsCombination"
   column so R will recognize each meal combination separately :

SeparatedGroupsofmealsCombs <- split(SingleMealCode,SingleMealCode$MealsCombinations)

after investigating the structure of "SeparatedGroupsofmealsCombs" , I can see
a list of different databases, each of which represents a different Meal combinations which is great.

No, I'm struggling with the last part, how can I run the maf code for all dataframes?

when I run this code as before :-

maf <- apply(SeparatedGroupsofmealsCombs, 2, function(x)maf(tabulate(x+1)))

an error message says : dim(X) must have a positive length . I'm not sure which length
I need to specify.. any suggestions to correct this syntax ?

Regards
Allaisone

________________________________
From: Boris Steipe <[hidden email]>
Sent: 17 November 2017 21:12:06
To: Allaisone 1
Cc: R-help
Subject: Re: [R] Complicated analysis for huge databases

Combine columns 1 and 2 into a column with a single ID like "33.55", "44.66" and use split() on these IDs to break up your dataset. Iterate over the list of data frames split() returns.


B.

> On Nov 17, 2017, at 12:59 PM, Allaisone 1 <[hidden email]> wrote:
>
>
> Hi all ..,
>
>
> I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :-
>
>
>> MyData
>
>       Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600
>
> 1    33                 55             1             0           1        2       0
>
> 2    33                 55              3             1          0        2        2
>
> 3    33                 55              5             2          1        1         2
>
> 4    44                 66               7            0          2         2        2
>
> 5   44                  66               4            1          1          0       1
>
> 6   44                  66                9            2          0          1       2
>
> .
>
> .
>
> 600,000
>
>
>
> I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
>
>
> I can do the analysis  for the entire column but not group by group like this :
>
>
> MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
>
> How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns.
>
> In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns.
> I have another sheet contains names of columns of interest like this :
>
>> ColOfinterest
>
> Col
> I
> IV
> V
> .
> .
> 300
>
> Any one would help with the best combination of syntax to perform this complex analysis?
>
> Regards
> Allaisone
>
>
>
>
>
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Complicated analysis for huge databases

Boris Steipe
Something like the following?

AllMAFs <- list()

for (i in length(SeparatedGroupsofmealsCombs) {
  AllMAFs[[i]] <- apply(SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf(tabulate(x+1)))
}


(untested, of course)
Also the solution is a bit generic since I don't know what the output of maf() looks like in your case, and I don't understand why you use tabulate because I would have assumed that's what maf() does - but that's not for me to worry about :-)



B.



> On Nov 17, 2017, at 7:15 PM, Allaisone 1 <[hidden email]> wrote:
>
>
> Thanks Boris , this was very helpful but I'm struggling with the last part.
>
> 1) I combined the first 2 columns :-
>
>  
> library(tidyr)
> SingleMealsCode <-unite(MyData, MealsCombinations, c(MealA, MealB), remove=FALSE)
> SingleMealsCode <- SingleMealsCode[,-2]
>
>   2) I separated this dataframe into different dataframes based on "MealsCombination"
>    column so R will recognize each meal combination separately :
>
> SeparatedGroupsofmealsCombs <- split(SingleMealCode,SingleMealCode$MealsCombinations)
>
> after investigating the structure of "SeparatedGroupsofmealsCombs" , I can see
> a list of different databases, each of which represents a different Meal combinations which is great.
>
> No, I'm struggling with the last part, how can I run the maf code for all dataframes?
>
> when I run this code as before :-
>
> maf <- apply(SeparatedGroupsofmealsCombs, 2, function(x)maf(tabulate(x+1)))
>
> an error message says : dim(X) must have a positive length . I'm not sure which length
> I need to specify.. any suggestions to correct this syntax ?  
>
> Regards
> Allaisone
> From: Boris Steipe <[hidden email]>
> Sent: 17 November 2017 21:12:06
> To: Allaisone 1
> Cc: R-help
> Subject: Re: [R] Complicated analysis for huge databases
>  
> Combine columns 1 and 2 into a column with a single ID like "33.55", "44.66" and use split() on these IDs to break up your dataset. Iterate over the list of data frames split() returns.
>
>
> B.
>
> > On Nov 17, 2017, at 12:59 PM, Allaisone 1 <[hidden email]> wrote:
> >
> >
> > Hi all ..,
> >
> >
> > I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :-
> >
> >
> >> MyData
> >
> >       Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600
> >
> > 1    33                 55             1             0           1        2       0
> >
> > 2    33                 55              3             1          0        2        2
> >
> > 3    33                 55              5             2          1        1         2
> >
> > 4    44                 66               7            0          2         2        2
> >
> > 5   44                  66               4            1          1          0       1
> >
> > 6   44                  66                9            2          0          1       2
> >
> > .
> >
> > .
> >
> > 600,000
> >
> >
> >
> > I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
> >
> >
> > I can do the analysis  for the entire column but not group by group like this :
> >
> >
> > MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
> >
> > How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns.
> >
> > In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns.
> > I have another sheet contains names of columns of interest like this :
> >
> >> ColOfinterest
> >
> > Col
> > I
> > IV
> > V
> > .
> > .
> > 300
> >
> > Any one would help with the best combination of syntax to perform this complex analysis?
> >
> > Regards
> > Allaisone
> >
> >
> >
> >
> >
> >
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Complicated analysis for huge databases

Bert Gunter-2
In reply to this post by Boris Steipe
Or do it at one go using ?tapply and friends

Bert

On Nov 17, 2017 1:12 PM, "Boris Steipe" <[hidden email]> wrote:

> Combine columns 1 and 2 into a column with a single ID like "33.55",
> "44.66" and use split() on these IDs to break up your dataset. Iterate over
> the list of data frames split() returns.
>
>
> B.
>
> > On Nov 17, 2017, at 12:59 PM, Allaisone 1 <[hidden email]>
> wrote:
> >
> >
> > Hi all ..,
> >
> >
> > I have a large dataset of around 600,000 rows and 600 columns. The first
> col is codes for Meal A, the second columns is codes for Meal B. The third
> column is customers IDs where each customer had a combination of meals.
> Each column of the rest columns contains values 0,1,or 2. The dataset is
> organised in a way so that the first group of customers had similar meals
> combinations, this is followed by another group of customers with similar
> meals combinations but different from the first group and so on. The
> dataset looks like this :-
> >
> >
> >> MyData
> >
> >       Meal A     Meal B     Cust.ID      I            II        III
>  IV   ...... 600
> >
> > 1    33                 55             1             0           1
>   2       0
> >
> > 2    33                 55              3             1          0
>   2        2
> >
> > 3    33                 55              5             2          1
>   1         2
> >
> > 4    44                 66               7            0          2
>    2        2
> >
> > 5   44                  66               4            1          1
>     0       1
> >
> > 6   44                  66                9            2          0
>     1       2
> >
> > .
> >
> > .
> >
> > 600,000
> >
> >
> >
> > I wanted to find maf() for each column(from 4 to 600) after calculating
> the frequency of the 3 values (0,1,2) but this should be done group by
> group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
> >
> >
> > I can do the analysis  for the entire column but not group by group like
> this :
> >
> >
> > MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
> >
> > How can I modify this code to tell R to do the analysis group by group
> for each column so I get maf value for 33-55 group of clolumn I, then maf
> value for group 44-66 in the same column I,then the rest of groups in this
> column and do the same for the remaining columns.
> >
> > In fact, I'm interested in doing this analysis for only 300 columns but
> all of the 600 columns.
> > I have another sheet contains names of columns of interest like this :
> >
> >> ColOfinterest
> >
> > Col
> > I
> > IV
> > V
> > .
> > .
> > 300
> >
> > Any one would help with the best combination of syntax to perform this
> complex analysis?
> >
> > Regards
> > Allaisone
> >
> >
> >
> >
> >
> >
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Complicated analysis for huge databases

Allaisone 1
In reply to this post by Boris Steipe
Although the loop seems to be formulated correctly I wonder why
it gives me these errors :

-object 'i' not found
- unexpected '}' in "}"


the desired output is expected to be very large as for each dataframe in the list of dataframes I expect to see maf value for each of the 600 columns! and this is only for

for one dataframe in the list .. I have around 150-200 dataframes.. not sure how R will store these results.. but first I need the analysis to be done correctly. The final output has to be something like this :-


> mafsforeachcolumns(I,II,...600)foreachcombination

      MealsCombinations    Cust.ID      I              II            III             IV       ...... 600
1          33-55                          1             0.124      0.10      0.65       0.467
                                                  3
                                                  5

2      44-66                                7           0.134     0.43       0.64       0.479
                                                  4
                                                  9

.

.

~180 dataframes


________________________________
From: Boris Steipe <[hidden email]>
Sent: 18 November 2017 00:35:16
To: Allaisone 1; R-help
Subject: Re: [R] Complicated analysis for huge databases

Something like the following?

AllMAFs <- list()

for (i in length(SeparatedGroupsofmealsCombs) {
  AllMAFs[[i]] <- apply(SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf(tabulate(x+1)))
}


(untested, of course)
Also the solution is a bit generic since I don't know what the output of maf() looks like in your case, and I don't understand why you use tabulate because I would have assumed that's what maf() does - but that's not for me to worry about :-)



B.



> On Nov 17, 2017, at 7:15 PM, Allaisone 1 <[hidden email]> wrote:
>
>
> Thanks Boris , this was very helpful but I'm struggling with the last part.
>
> 1) I combined the first 2 columns :-
>
>
> library(tidyr)
> SingleMealsCode <-unite(MyData, MealsCombinations, c(MealA, MealB), remove=FALSE)
> SingleMealsCode <- SingleMealsCode[,-2]
>
>   2) I separated this dataframe into different dataframes based on "MealsCombination"
>    column so R will recognize each meal combination separately :
>
> SeparatedGroupsofmealsCombs <- split(SingleMealCode,SingleMealCode$MealsCombinations)
>
> after investigating the structure of "SeparatedGroupsofmealsCombs" , I can see
> a list of different databases, each of which represents a different Meal combinations which is great.
>
> No, I'm struggling with the last part, how can I run the maf code for all dataframes?
>
> when I run this code as before :-
>
> maf <- apply(SeparatedGroupsofmealsCombs, 2, function(x)maf(tabulate(x+1)))
>
> an error message says : dim(X) must have a positive length . I'm not sure which length
> I need to specify.. any suggestions to correct this syntax ?
>
> Regards
> Allaisone
> From: Boris Steipe <[hidden email]>
> Sent: 17 November 2017 21:12:06
> To: Allaisone 1
> Cc: R-help
> Subject: Re: [R] Complicated analysis for huge databases
>
> Combine columns 1 and 2 into a column with a single ID like "33.55", "44.66" and use split() on these IDs to break up your dataset. Iterate over the list of data frames split() returns.
>
>
> B.
>
> > On Nov 17, 2017, at 12:59 PM, Allaisone 1 <[hidden email]> wrote:
> >
> >
> > Hi all ..,
> >
> >
> > I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :-
> >
> >
> >> MyData
> >
> >       Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600
> >
> > 1    33                 55             1             0           1        2       0
> >
> > 2    33                 55              3             1          0        2        2
> >
> > 3    33                 55              5             2          1        1         2
> >
> > 4    44                 66               7            0          2         2        2
> >
> > 5   44                  66               4            1          1          0       1
> >
> > 6   44                  66                9            2          0          1       2
> >
> > .
> >
> > .
> >
> > 600,000
> >
> >
> >
> > I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
> >
> >
> > I can do the analysis  for the entire column but not group by group like this :
> >
> >
> > MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
> >
> > How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns.
> >
> > In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns.
> > I have another sheet contains names of columns of interest like this :
> >
> >> ColOfinterest
> >
> > Col
> > I
> > IV
> > V
> > .
> > .
> > 300
> >
> > Any one would help with the best combination of syntax to perform this complex analysis?
> >
> > Regards
> > Allaisone
> >
> >
> >
> >
> >
> >
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Complicated analysis for huge databases

David Winsemius

> On Nov 18, 2017, at 1:52 AM, Allaisone 1 <[hidden email]> wrote:
>
> Although the loop seems to be formulated correctly I wonder why
> it gives me these errors :
>
> -object 'i' not found
> - unexpected '}' in "}"

You probably did not copy the entire code offered. But we cannot know since you did not "show your code", not=r did you post complete error messages. Both of these practices are strongly recommended by the Posting Guide. Please read it (again?).

--
David.

>
>
> the desired output is expected to be very large as for each dataframe in the list of dataframes I expect to see maf value for each of the 600 columns! and this is only for
>
> for one dataframe in the list .. I have around 150-200 dataframes.. not sure how R will store these results.. but first I need the analysis to be done correctly. The final output has to be something like this :-
>
>
>> mafsforeachcolumns(I,II,...600)foreachcombination
>
>      MealsCombinations    Cust.ID      I              II            III             IV       ...... 600
> 1          33-55                          1             0.124      0.10      0.65       0.467
>                                                  3
>                                                  5
>
> 2      44-66                                7           0.134     0.43       0.64       0.479
>                                                  4
>                                                  9
>
> .
>
> .
>
> ~180 dataframes
>
>
> ________________________________
> From: Boris Steipe <[hidden email]>
> Sent: 18 November 2017 00:35:16
> To: Allaisone 1; R-help
> Subject: Re: [R] Complicated analysis for huge databases
>
> Something like the following?
>
> AllMAFs <- list()
>
> for (i in length(SeparatedGroupsofmealsCombs) {
>  AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf( tabulate( x+1) ))
> }
>
>
> (untested, of course)
> Also the solution is a bit generic since I don't know what the output of maf() looks like in your case, and I don't understand why you use tabulate because I would have assumed that's what maf() does - but that's not for me to worry about :-)
>
>
>
> B.
>
>
>
>> On Nov 17, 2017, at 7:15 PM, Allaisone 1 <[hidden email]> wrote:
>>
>>
>> Thanks Boris , this was very helpful but I'm struggling with the last part.
>>
>> 1) I combined the first 2 columns :-
>>
>>
>> library(tidyr)
>> SingleMealsCode <-unite(MyData, MealsCombinations, c(MealA, MealB), remove=FALSE)
>> SingleMealsCode <- SingleMealsCode[,-2]
>>
>>  2) I separated this dataframe into different dataframes based on "MealsCombination"
>>   column so R will recognize each meal combination separately :
>>
>> SeparatedGroupsofmealsCombs <- split(SingleMealCode,SingleMealCode$MealsCombinations)
>>
>> after investigating the structure of "SeparatedGroupsofmealsCombs" , I can see
>> a list of different databases, each of which represents a different Meal combinations which is great.
>>
>> No, I'm struggling with the last part, how can I run the maf code for all dataframes?
>>
>> when I run this code as before :-
>>
>> maf <- apply(SeparatedGroupsofmealsCombs, 2, function(x)maf(tabulate(x+1)))
>>
>> an error message says : dim(X) must have a positive length . I'm not sure which length
>> I need to specify.. any suggestions to correct this syntax ?
>>
>> Regards
>> Allaisone
>> From: Boris Steipe <[hidden email]>
>> Sent: 17 November 2017 21:12:06
>> To: Allaisone 1
>> Cc: R-help
>> Subject: Re: [R] Complicated analysis for huge databases
>>
>> Combine columns 1 and 2 into a column with a single ID like "33.55", "44.66" and use split() on these IDs to break up your dataset. Iterate over the list of data frames split() returns.
>>
>>
>> B.
>>
>>> On Nov 17, 2017, at 12:59 PM, Allaisone 1 <[hidden email]> wrote:
>>>
>>>
>>> Hi all ..,
>>>
>>>
>>> I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :-
>>>
>>>
>>>> MyData
>>>
>>>      Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600
>>>
>>> 1    33                 55             1             0           1        2       0
>>>
>>> 2    33                 55              3             1          0        2        2
>>>
>>> 3    33                 55              5             2          1        1         2
>>>
>>> 4    44                 66               7            0          2         2        2
>>>
>>> 5   44                  66               4            1          1          0       1
>>>
>>> 6   44                  66                9            2          0          1       2
>>>
>>> .
>>>
>>> .
>>>
>>> 600,000
>>>
>>>
>>>
>>> I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
>>>
>>>
>>> I can do the analysis  for the entire column but not group by group like this :
>>>
>>>
>>> MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
>>>
>>> How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns.
>>>
>>> In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns.
>>> I have another sheet contains names of columns of interest like this :
>>>
>>>> ColOfinterest
>>>
>>> Col
>>> I
>>> IV
>>> V
>>> .
>>> .
>>> 300
>>>
>>> Any one would help with the best combination of syntax to perform this complex analysis?
>>>
>>> Regards
>>> Allaisone
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.'   -Gehm's Corollary to Clarke's Third Law

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Complicated analysis for huge databases

Allaisone 1

The loop :


AllMAFs <- list()

 for (i in length(SeparatedGroupsofmealsCombs) {
  AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf( tabulate( x+1) ))
}


gives these errors (I tried this many times and I'm sure I copied it entirely) :-

Error in apply(SeparatedGroupsofmealsCombs[[i]], 2, function(x) maf(tabulate(x +  :
  object 'i' not found
>  }
Error: unexpected '}' in " }"


The lapply function :
  results<-lapply(SeparatedGroupsofmealsCombs , function(x)maf(tabulate(x+1)))
gives this error :-
Error in FUN(left, right) : non-numeric argument to binary operator

I have been trying since yesterday but but until now I'm not able to identify
the correct syntax.




________________________________
From: David Winsemius <[hidden email]>
Sent: 18 November 2017 20:06:56
To: Allaisone 1
Cc: Boris Steipe; R-help
Subject: Re: [R] Complicated analysis for huge databases


> On Nov 18, 2017, at 1:52 AM, Allaisone 1 <[hidden email]> wrote:
>
> Although the loop seems to be formulated correctly I wonder why
> it gives me these errors :
>
> -object 'i' not found
> - unexpected '}' in "}"

You probably did not copy the entire code offered. But we cannot know since you did not "show your code", not=r did you post complete error messages. Both of these practices are strongly recommended by the Posting Guide. Please read it (again?).

--
David.

>
>
> the desired output is expected to be very large as for each dataframe in the list of dataframes I expect to see maf value for each of the 600 columns! and this is only for
>
> for one dataframe in the list .. I have around 150-200 dataframes.. not sure how R will store these results.. but first I need the analysis to be done correctly. The final output has to be something like this :-
>
>
>> mafsforeachcolumns(I,II,...600)foreachcombination
>
>      MealsCombinations    Cust.ID      I              II            III             IV       ...... 600
> 1          33-55                          1             0.124      0.10      0.65       0.467
>                                                  3
>                                                  5
>
> 2      44-66                                7           0.134     0.43       0.64       0.479
>                                                  4
>                                                  9
>
> .
>
> .
>
> ~180 dataframes
>
>
> ________________________________
> From: Boris Steipe <[hidden email]>
> Sent: 18 November 2017 00:35:16
> To: Allaisone 1; R-help
> Subject: Re: [R] Complicated analysis for huge databases
>
> Something like the following?
>
> AllMAFs <- list()
>
> for (i in length(SeparatedGroupsofmealsCombs) {
>  AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf( tabulate( x+1) ))
> }
>
>
> (untested, of course)
> Also the solution is a bit generic since I don't know what the output of maf() looks like in your case, and I don't understand why you use tabulate because I would have assumed that's what maf() does - but that's not for me to worry about :-)
>
>
>
> B.
>
>
>
>> On Nov 17, 2017, at 7:15 PM, Allaisone 1 <[hidden email]> wrote:
>>
>>
>> Thanks Boris , this was very helpful but I'm struggling with the last part.
>>
>> 1) I combined the first 2 columns :-
>>
>>
>> library(tidyr)
>> SingleMealsCode <-unite(MyData, MealsCombinations, c(MealA, MealB), remove=FALSE)
>> SingleMealsCode <- SingleMealsCode[,-2]
>>
>>  2) I separated this dataframe into different dataframes based on "MealsCombination"
>>   column so R will recognize each meal combination separately :
>>
>> SeparatedGroupsofmealsCombs <- split(SingleMealCode,SingleMealCode$MealsCombinations)
>>
>> after investigating the structure of "SeparatedGroupsofmealsCombs" , I can see
>> a list of different databases, each of which represents a different Meal combinations which is great.
>>
>> No, I'm struggling with the last part, how can I run the maf code for all dataframes?
>>
>> when I run this code as before :-
>>
>> maf <- apply(SeparatedGroupsofmealsCombs, 2, function(x)maf(tabulate(x+1)))
>>
>> an error message says : dim(X) must have a positive length . I'm not sure which length
>> I need to specify.. any suggestions to correct this syntax ?
>>
>> Regards
>> Allaisone
>> From: Boris Steipe <[hidden email]>
>> Sent: 17 November 2017 21:12:06
>> To: Allaisone 1
>> Cc: R-help
>> Subject: Re: [R] Complicated analysis for huge databases
>>
>> Combine columns 1 and 2 into a column with a single ID like "33.55", "44.66" and use split() on these IDs to break up your dataset. Iterate over the list of data frames split() returns.
>>
>>
>> B.
>>
>>> On Nov 17, 2017, at 12:59 PM, Allaisone 1 <[hidden email]> wrote:
>>>
>>>
>>> Hi all ..,
>>>
>>>
>>> I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :-
>>>
>>>
>>>> MyData
>>>
>>>      Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600
>>>
>>> 1    33                 55             1             0           1        2       0
>>>
>>> 2    33                 55              3             1          0        2        2
>>>
>>> 3    33                 55              5             2          1        1         2
>>>
>>> 4    44                 66               7            0          2         2        2
>>>
>>> 5   44                  66               4            1          1          0       1
>>>
>>> 6   44                  66                9            2          0          1       2
>>>
>>> .
>>>
>>> .
>>>
>>> 600,000
>>>
>>>
>>>
>>> I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
>>>
>>>
>>> I can do the analysis  for the entire column but not group by group like this :
>>>
>>>
>>> MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
>>>
>>> How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns.
>>>
>>> In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns.
>>> I have another sheet contains names of columns of interest like this :
>>>
>>>> ColOfinterest
>>>
>>> Col
>>> I
>>> IV
>>> V
>>> .
>>> .
>>> 300
>>>
>>> Any one would help with the best combination of syntax to perform this complex analysis?
>>>
>>> Regards
>>> Allaisone
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.'   -Gehm's Corollary to Clarke's Third Law






        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Complicated analysis for huge databases

Boris Steipe

The correct code is:

   for (i in 1:length(SeparatedGroupsofmealsCombs)) { ...


I had mentioned that this is untested, but the error is so obvious ...




B.



> On Nov 18, 2017, at 4:40 PM, Allaisone 1 <[hidden email]> wrote:
>
>
> The loop :
>
> AllMAFs <- list()
>  
>  for (i in length(SeparatedGroupsofmealsCombs) {
>   AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf( tabulate( x+1) ))
> }
>
> gives these errors (I tried this many times and I'm sure I copied it entirely) :-
> Error in apply(SeparatedGroupsofmealsCombs[[i]], 2, function(x) maf(tabulate(x +  :
>   object 'i' not found
> >  }
> Error: unexpected '}' in " }"
>
>
> The lapply function :
>   results<-lapply(SeparatedGroupsofmealsCombs , function(x)maf(tabulate(x+1)))
> gives this error :-
> Error in FUN(left, right) : non-numeric argument to binary operator
>
> I have been trying since yesterday but but until now I'm not able to identify
> the correct syntax.
>
>
>
>
> From: David Winsemius <[hidden email]>
> Sent: 18 November 2017 20:06:56
> To: Allaisone 1
> Cc: Boris Steipe; R-help
> Subject: Re: [R] Complicated analysis for huge databases
>  
>
> > On Nov 18, 2017, at 1:52 AM, Allaisone 1 <[hidden email]> wrote:
> >
> > Although the loop seems to be formulated correctly I wonder why
> > it gives me these errors :
> >
> > -object 'i' not found
> > - unexpected '}' in "}"
>
> You probably did not copy the entire code offered. But we cannot know since you did not "show your code", not=r did you post complete error messages. Both of these practices are strongly recommended by the Posting Guide. Please read it (again?).
>
> --
> David.
> >
> >
> > the desired output is expected to be very large as for each dataframe in the list of dataframes I expect to see maf value for each of the 600 columns! and this is only for
> >
> > for one dataframe in the list .. I have around 150-200 dataframes.. not sure how R will store these results.. but first I need the analysis to be done correctly. The final output has to be something like this :-
> >
> >
> >> mafsforeachcolumns(I,II,...600)foreachcombination
> >
> >      MealsCombinations    Cust.ID      I              II            III             IV       ...... 600
> > 1          33-55                          1             0.124      0.10      0.65       0.467
> >                                                  3
> >                                                  5
> >
> > 2      44-66                                7           0.134     0.43       0.64       0.479
> >                                                  4
> >                                                  9
> >
> > .
> >
> > .
> >
> > ~180 dataframes
> >
> >
> > ________________________________
> > From: Boris Steipe <[hidden email]>
> > Sent: 18 November 2017 00:35:16
> > To: Allaisone 1; R-help
> > Subject: Re: [R] Complicated analysis for huge databases
> >
> > Something like the following?
> >
> > AllMAFs <- list()
> >
> > for (i in length(SeparatedGroupsofmealsCombs) {
> >  AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf( tabulate( x+1) ))
> > }
> >
> >
> > (untested, of course)
> > Also the solution is a bit generic since I don't know what the output of maf() looks like in your case, and I don't understand why you use tabulate because I would have assumed that's what maf() does - but that's not for me to worry about :-)
> >
> >
> >
> > B.
> >
> >
> >
> >> On Nov 17, 2017, at 7:15 PM, Allaisone 1 <[hidden email]> wrote:
> >>
> >>
> >> Thanks Boris , this was very helpful but I'm struggling with the last part.
> >>
> >> 1) I combined the first 2 columns :-
> >>
> >>
> >> library(tidyr)
> >> SingleMealsCode <-unite(MyData, MealsCombinations, c(MealA, MealB), remove=FALSE)
> >> SingleMealsCode <- SingleMealsCode[,-2]
> >>
> >>  2) I separated this dataframe into different dataframes based on "MealsCombination"
> >>   column so R will recognize each meal combination separately :
> >>
> >> SeparatedGroupsofmealsCombs <- split(SingleMealCode,SingleMealCode$MealsCombinations)
> >>
> >> after investigating the structure of "SeparatedGroupsofmealsCombs" , I can see
> >> a list of different databases, each of which represents a different Meal combinations which is great.
> >>
> >> No, I'm struggling with the last part, how can I run the maf code for all dataframes?
> >>
> >> when I run this code as before :-
> >>
> >> maf <- apply(SeparatedGroupsofmealsCombs, 2, function(x)maf(tabulate(x+1)))
> >>
> >> an error message says : dim(X) must have a positive length . I'm not sure which length
> >> I need to specify.. any suggestions to correct this syntax ?
> >>
> >> Regards
> >> Allaisone
> >> From: Boris Steipe <[hidden email]>
> >> Sent: 17 November 2017 21:12:06
> >> To: Allaisone 1
> >> Cc: R-help
> >> Subject: Re: [R] Complicated analysis for huge databases
> >>
> >> Combine columns 1 and 2 into a column with a single ID like "33.55", "44.66" and use split() on these IDs to break up your dataset. Iterate over the list of data frames split() returns.
> >>
> >>
> >> B.
> >>
> >>> On Nov 17, 2017, at 12:59 PM, Allaisone 1 <[hidden email]> wrote:
> >>>
> >>>
> >>> Hi all ..,
> >>>
> >>>
> >>> I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :-
> >>>
> >>>
> >>>> MyData
> >>>
> >>>      Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600
> >>>
> >>> 1    33                 55             1             0           1        2       0
> >>>
> >>> 2    33                 55              3             1          0        2        2
> >>>
> >>> 3    33                 55              5             2          1        1         2
> >>>
> >>> 4    44                 66               7            0          2         2        2
> >>>
> >>> 5   44                  66               4            1          1          0       1
> >>>
> >>> 6   44                  66                9            2          0          1       2
> >>>
> >>> .
> >>>
> >>> .
> >>>
> >>> 600,000
> >>>
> >>>
> >>>
> >>> I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
> >>>
> >>>
> >>> I can do the analysis  for the entire column but not group by group like this :
> >>>
> >>>
> >>> MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
> >>>
> >>> How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns.
> >>>
> >>> In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns.
> >>> I have another sheet contains names of columns of interest like this :
> >>>
> >>>> ColOfinterest
> >>>
> >>> Col
> >>> I
> >>> IV
> >>> V
> >>> .
> >>> .
> >>> 300
> >>>
> >>> Any one would help with the best combination of syntax to perform this complex analysis?
> >>>
> >>> Regards
> >>> Allaisone
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>       [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
> 'Any technology distinguishable from magic is insufficiently advanced.'   -Gehm's Corollary to Clarke's Third Law

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Complicated analysis for huge databases

Duncan Murdoch-2
In reply to this post by Allaisone 1
On 18/11/2017 4:40 PM, Allaisone 1 wrote:

>
> The loop :
>
>
> AllMAFs <- list()
>
>   for (i in length(SeparatedGroupsofmealsCombs) {
>    AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf( tabulate( x+1) ))
> }
>
>
> gives these errors (I tried this many times and I'm sure I copied it entirely) :-
>
> Error in apply(SeparatedGroupsofmealsCombs[[i]], 2, function(x) maf(tabulate(x +  :
>    object 'i' not found
>>   }
> Error: unexpected '}' in " }"

The first line of the for loop is short by one ")".  You should have
seen this error after the first line:

Error: unexpected '{' in " for (i in length(SeparatedGroupsofmealsCombs) {"

Once that line is thrown away, the message about i makes sense:

AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2,
function(x)maf( tabulate( x+1) ))

refers to a variable "i" that has never been defined.

Duncan Murdoch

>
>
> The lapply function :
>    results<-lapply(SeparatedGroupsofmealsCombs , function(x)maf(tabulate(x+1)))
> gives this error :-
> Error in FUN(left, right) : non-numeric argument to binary operator
>
> I have been trying since yesterday but but until now I'm not able to identify
> the correct syntax
>
>
>
>
> ________________________________
> From: David Winsemius <[hidden email]>
> Sent: 18 November 2017 20:06:56
> To: Allaisone 1
> Cc: Boris Steipe; R-help
> Subject: Re: [R] Complicated analysis for huge databases
>
>
>> On Nov 18, 2017, at 1:52 AM, Allaisone 1 <[hidden email]> wrote:
>>
>> Although the loop seems to be formulated correctly I wonder why
>> it gives me these errors :
>>
>> -object 'i' not found
>> - unexpected '}' in "}"
>
> You probably did not copy the entire code offered. But we cannot know since you did not "show your code", not=r did you post complete error messages. Both of these practices are strongly recommended by the Posting Guide. Please read it (again?).
>
> --
> David.
>>
>>
>> the desired output is expected to be very large as for each dataframe in the list of dataframes I expect to see maf value for each of the 600 columns! and this is only for
>>
>> for one dataframe in the list .. I have around 150-200 dataframes.. not sure how R will store these results.. but first I need the analysis to be done correctly. The final output has to be something like this :-
>>
>>
>>> mafsforeachcolumns(I,II,...600)foreachcombination
>>
>>       MealsCombinations    Cust.ID      I              II            III             IV       ...... 600
>> 1          33-55                          1             0.124      0.10      0.65       0.467
>>                                                   3
>>                                                   5
>>
>> 2      44-66                                7           0.134     0.43       0.64       0.479
>>                                                   4
>>                                                   9
>>
>> .
>>
>> .
>>
>> ~180 dataframes
>>
>>
>> ________________________________
>> From: Boris Steipe <[hidden email]>
>> Sent: 18 November 2017 00:35:16
>> To: Allaisone 1; R-help
>> Subject: Re: [R] Complicated analysis for huge databases
>>
>> Something like the following?
>>
>> AllMAFs <- list()
>>
>> for (i in length(SeparatedGroupsofmealsCombs) {
>>   AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf( tabulate( x+1) ))
>> }
>>
>>
>> (untested, of course)
>> Also the solution is a bit generic since I don't know what the output of maf() looks like in your case, and I don't understand why you use tabulate because I would have assumed that's what maf() does - but that's not for me to worry about :-)
>>
>>
>>
>> B.
>>
>>
>>
>>> On Nov 17, 2017, at 7:15 PM, Allaisone 1 <[hidden email]> wrote:
>>>
>>>
>>> Thanks Boris , this was very helpful but I'm struggling with the last part.
>>>
>>> 1) I combined the first 2 columns :-
>>>
>>>
>>> library(tidyr)
>>> SingleMealsCode <-unite(MyData, MealsCombinations, c(MealA, MealB), remove=FALSE)
>>> SingleMealsCode <- SingleMealsCode[,-2]
>>>
>>>   2) I separated this dataframe into different dataframes based on "MealsCombination"
>>>    column so R will recognize each meal combination separately :
>>>
>>> SeparatedGroupsofmealsCombs <- split(SingleMealCode,SingleMealCode$MealsCombinations)
>>>
>>> after investigating the structure of "SeparatedGroupsofmealsCombs" , I can see
>>> a list of different databases, each of which represents a different Meal combinations which is great.
>>>
>>> No, I'm struggling with the last part, how can I run the maf code for all dataframes?
>>>
>>> when I run this code as before :-
>>>
>>> maf <- apply(SeparatedGroupsofmealsCombs, 2, function(x)maf(tabulate(x+1)))
>>>
>>> an error message says : dim(X) must have a positive length . I'm not sure which length
>>> I need to specify.. any suggestions to correct this syntax ?
>>>
>>> Regards
>>> Allaisone
>>> From: Boris Steipe <[hidden email]>
>>> Sent: 17 November 2017 21:12:06
>>> To: Allaisone 1
>>> Cc: R-help
>>> Subject: Re: [R] Complicated analysis for huge databases
>>>
>>> Combine columns 1 and 2 into a column with a single ID like "33.55", "44.66" and use split() on these IDs to break up your dataset. Iterate over the list of data frames split() returns.
>>>
>>>
>>> B.
>>>
>>>> On Nov 17, 2017, at 12:59 PM, Allaisone 1 <[hidden email]> wrote:
>>>>
>>>>
>>>> Hi all ..,
>>>>
>>>>
>>>> I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :-
>>>>
>>>>
>>>>> MyData
>>>>
>>>>       Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600
>>>>
>>>> 1    33                 55             1             0           1        2       0
>>>>
>>>> 2    33                 55              3             1          0        2        2
>>>>
>>>> 3    33                 55              5             2          1        1         2
>>>>
>>>> 4    44                 66               7            0          2         2        2
>>>>
>>>> 5   44                  66               4            1          1          0       1
>>>>
>>>> 6   44                  66                9            2          0          1       2
>>>>
>>>> .
>>>>
>>>> .
>>>>
>>>> 600,000
>>>>
>>>>
>>>>
>>>> I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
>>>>
>>>>
>>>> I can do the analysis  for the entire column but not group by group like this :
>>>>
>>>>
>>>> MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
>>>>
>>>> How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns.
>>>>
>>>> In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns.
>>>> I have another sheet contains names of columns of interest like this :
>>>>
>>>>> ColOfinterest
>>>>
>>>> Col
>>>> I
>>>> IV
>>>> V
>>>> .
>>>> .
>>>> 300
>>>>
>>>> Any one would help with the best combination of syntax to perform this complex analysis?
>>>>
>>>> Regards
>>>> Allaisone
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>        [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
> 'Any technology distinguishable from magic is insufficiently advanced.'   -Gehm's Corollary to Clarke's Third Law
>
>
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Complicated analysis for huge databases

Allaisone 1
Thanks but a new error appeared with the loop :


Error in x + 1 : non-numeric argument to binary operator


I think this can be solved by converting columns (I,II,II,..600) into "numeric" instead of

the current "int" type as shown below in the structure of "33_55" dataframe .


$ 33_55:'data.frame': 256 obs. of  600 variables:
  ..$ MealsCombinations                 : chr [1:256] "33_55" "33_55" ....
  ..$ ID                               : num [1:256] 1  3  5  ...
  ..$ I                   : int [1:256] 1 1 2 1 1 2 1 2 1 1 ...
  ..$ II                    : int [1:256] 2 1 2 2 1 2 2 2 2 2 ...
  ..$ III                  : int [1:256] 1 1 2 1 1 2 1 2 1 1 ...
  ..$ IV                  : int [1:256] 2 2 2 2 2 2 2 2 2 2 ...


any suggestions to solve this issue rather than unlisting the list as this will affect

the current shape of data (i.e. being separated dataframes). I need to find maf  for each column(I,II,III..600) under this dataframe (33_55) and the rest of dataframes.

________________________________
From: Duncan Murdoch <[hidden email]>
Sent: 18 November 2017 23:15:15
To: Allaisone 1; David Winsemius
Cc: R-help
Subject: Re: [R] Complicated analysis for huge databases

On 18/11/2017 4:40 PM, Allaisone 1 wrote:

>
> The loop :
>
>
> AllMAFs <- list()
>
>   for (i in length(SeparatedGroupsofmealsCombs) {
>    AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf( tabulate( x+1) ))
> }
>
>
> gives these errors (I tried this many times and I'm sure I copied it entirely) :-
>
> Error in apply(SeparatedGroupsofmealsCombs[[i]], 2, function(x) maf(tabulate(x +  :
>    object 'i' not found
>>   }
> Error: unexpected '}' in " }"

The first line of the for loop is short by one ")".  You should have
seen this error after the first line:

Error: unexpected '{' in " for (i in length(SeparatedGroupsofmealsCombs) {"

Once that line is thrown away, the message about i makes sense:

AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2,
function(x)maf( tabulate( x+1) ))

refers to a variable "i" that has never been defined.

Duncan Murdoch

>
>
> The lapply function :
>    results<-lapply(SeparatedGroupsofmealsCombs , function(x)maf(tabulate(x+1)))
> gives this error :-
> Error in FUN(left, right) : non-numeric argument to binary operator
>
> I have been trying since yesterday but but until now I'm not able to identify
> the correct syntax
>
>
>
>
> ________________________________
> From: David Winsemius <[hidden email]>
> Sent: 18 November 2017 20:06:56
> To: Allaisone 1
> Cc: Boris Steipe; R-help
> Subject: Re: [R] Complicated analysis for huge databases
>
>
>> On Nov 18, 2017, at 1:52 AM, Allaisone 1 <[hidden email]> wrote:
>>
>> Although the loop seems to be formulated correctly I wonder why
>> it gives me these errors :
>>
>> -object 'i' not found
>> - unexpected '}' in "}"
>
> You probably did not copy the entire code offered. But we cannot know since you did not "show your code", not=r did you post complete error messages. Both of these practices are strongly recommended by the Posting Guide. Please read it (again?).
>
> --
> David.
>>
>>
>> the desired output is expected to be very large as for each dataframe in the list of dataframes I expect to see maf value for each of the 600 columns! and this is only for
>>
>> for one dataframe in the list .. I have around 150-200 dataframes.. not sure how R will store these results.. but first I need the analysis to be done correctly. The final output has to be something like this :-
>>
>>
>>> mafsforeachcolumns(I,II,...600)foreachcombination
>>
>>       MealsCombinations    Cust.ID      I              II            III             IV       ...... 600
>> 1          33-55                          1             0.124      0.10      0.65       0.467
>>                                                   3
>>                                                   5
>>
>> 2      44-66                                7           0.134     0.43       0.64       0.479
>>                                                   4
>>                                                   9
>>
>> .
>>
>> .
>>
>> ~180 dataframes
>>
>>
>> ________________________________
>> From: Boris Steipe <[hidden email]>
>> Sent: 18 November 2017 00:35:16
>> To: Allaisone 1; R-help
>> Subject: Re: [R] Complicated analysis for huge databases
>>
>> Something like the following?
>>
>> AllMAFs <- list()
>>
>> for (i in length(SeparatedGroupsofmealsCombs) {
>>   AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf( tabulate( x+1) ))
>> }
>>
>>
>> (untested, of course)
>> Also the solution is a bit generic since I don't know what the output of maf() looks like in your case, and I don't understand why you use tabulate because I would have assumed that's what maf() does - but that's not for me to worry about :-)
>>
>>
>>
>> B.
>>
>>
>>
>>> On Nov 17, 2017, at 7:15 PM, Allaisone 1 <[hidden email]> wrote:
>>>
>>>
>>> Thanks Boris , this was very helpful but I'm struggling with the last part.
>>>
>>> 1) I combined the first 2 columns :-
>>>
>>>
>>> library(tidyr)
>>> SingleMealsCode <-unite(MyData, MealsCombinations, c(MealA, MealB), remove=FALSE)
>>> SingleMealsCode <- SingleMealsCode[,-2]
>>>
>>>   2) I separated this dataframe into different dataframes based on "MealsCombination"
>>>    column so R will recognize each meal combination separately :
>>>
>>> SeparatedGroupsofmealsCombs <- split(SingleMealCode,SingleMealCode$MealsCombinations)
>>>
>>> after investigating the structure of "SeparatedGroupsofmealsCombs" , I can see
>>> a list of different databases, each of which represents a different Meal combinations which is great.
>>>
>>> No, I'm struggling with the last part, how can I run the maf code for all dataframes?
>>>
>>> when I run this code as before :-
>>>
>>> maf <- apply(SeparatedGroupsofmealsCombs, 2, function(x)maf(tabulate(x+1)))
>>>
>>> an error message says : dim(X) must have a positive length . I'm not sure which length
>>> I need to specify.. any suggestions to correct this syntax ?
>>>
>>> Regards
>>> Allaisone
>>> From: Boris Steipe <[hidden email]>
>>> Sent: 17 November 2017 21:12:06
>>> To: Allaisone 1
>>> Cc: R-help
>>> Subject: Re: [R] Complicated analysis for huge databases
>>>
>>> Combine columns 1 and 2 into a column with a single ID like "33.55", "44.66" and use split() on these IDs to break up your dataset. Iterate over the list of data frames split() returns.
>>>
>>>
>>> B.
>>>
>>>> On Nov 17, 2017, at 12:59 PM, Allaisone 1 <[hidden email]> wrote:
>>>>
>>>>
>>>> Hi all ..,
>>>>
>>>>
>>>> I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :-
>>>>
>>>>
>>>>> MyData
>>>>
>>>>       Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600
>>>>
>>>> 1    33                 55             1             0           1        2       0
>>>>
>>>> 2    33                 55              3             1          0        2        2
>>>>
>>>> 3    33                 55              5             2          1        1         2
>>>>
>>>> 4    44                 66               7            0          2         2        2
>>>>
>>>> 5   44                  66               4            1          1          0       1
>>>>
>>>> 6   44                  66                9            2          0          1       2
>>>>
>>>> .
>>>>
>>>> .
>>>>
>>>> 600,000
>>>>
>>>>
>>>>
>>>> I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
>>>>
>>>>
>>>> I can do the analysis  for the entire column but not group by group like this :
>>>>
>>>>
>>>> MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
>>>>
>>>> How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns.
>>>>
>>>> In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns.
>>>> I have another sheet contains names of columns of interest like this :
>>>>
>>>>> ColOfinterest
>>>>
>>>> Col
>>>> I
>>>> IV
>>>> V
>>>> .
>>>> .
>>>> 300
>>>>
>>>> Any one would help with the best combination of syntax to perform this complex analysis?
>>>>
>>>> Regards
>>>> Allaisone
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>        [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
> 'Any technology distinguishable from magic is insufficiently advanced.'   -Gehm's Corollary to Clarke's Third Law
>
>
>
>
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


        [[alternative HTML version deleted]]

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Reply | Threaded
Open this post in threaded view
|

Re: Complicated analysis for huge databases

Boris Steipe
Here are some elementary facts for you to ponder:

R > is.numeric(1)
[1] TRUE
R > is.integer(1)
[1] FALSE
R > is.numeric(1L)
[1] TRUE
R > is.integer(1L)
[1] TRUE
R > 1 + 1
[1] 2
R > 1 + 1L
[1] 2
R > 1L + 1L
[1] 2
R > 1L + "1"
Error in 1L + "1" : non-numeric argument to binary operator


Now, here is a word of caution: Your analysis of the error, and your failure to spot the egregious error that I had made in a previous post (embarrassing, I know) tells me that you are way out of your league with even simple data handling in R. You may get this to run somehow, but unless you step back and invest some serious time in a systematic introduction to R, you are going to let yourself down, and others who rely on your work.

That's going to be all from me.


B.


> On Nov 18, 2017, at 7:15 PM, Allaisone 1 <[hidden email]> wrote:
>
> Thanks but a new error appeared with the loop :
>
>
> Error in x + 1 : non-numeric argument to binary operator
>
>
> I think this can be solved by converting columns (I,II,II,..600) into "numeric" instead of
>
> the current "int" type as shown below in the structure of "33_55" dataframe .
>
>
> $ 33_55:'data.frame': 256 obs. of  600 variables:
>  ..$ MealsCombinations                 : chr [1:256] "33_55" "33_55" ....
>  ..$ ID                               : num [1:256] 1  3  5  ...
>  ..$ I                   : int [1:256] 1 1 2 1 1 2 1 2 1 1 ...
>  ..$ II                    : int [1:256] 2 1 2 2 1 2 2 2 2 2 ...
>  ..$ III                  : int [1:256] 1 1 2 1 1 2 1 2 1 1 ...
>  ..$ IV                  : int [1:256] 2 2 2 2 2 2 2 2 2 2 ...
>
>
> any suggestions to solve this issue rather than unlisting the list as this will affect
>
> the current shape of data (i.e. being separated dataframes). I need to find maf  for each column(I,II,III..600) under this dataframe (33_55) and the rest of dataframes.
>
> ________________________________
> From: Duncan Murdoch <[hidden email]>
> Sent: 18 November 2017 23:15:15
> To: Allaisone 1; David Winsemius
> Cc: R-help
> Subject: Re: [R] Complicated analysis for huge databases
>
> On 18/11/2017 4:40 PM, Allaisone 1 wrote:
>>
>> The loop :
>>
>>
>> AllMAFs <- list()
>>
>>  for (i in length(SeparatedGroupsofmealsCombs) {
>>   AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf( tabulate( x+1) ))
>> }
>>
>>
>> gives these errors (I tried this many times and I'm sure I copied it entirely) :-
>>
>> Error in apply(SeparatedGroupsofmealsCombs[[i]], 2, function(x) maf(tabulate(x +  :
>>   object 'i' not found
>>>  }
>> Error: unexpected '}' in " }"
>
> The first line of the for loop is short by one ")".  You should have
> seen this error after the first line:
>
> Error: unexpected '{' in " for (i in length(SeparatedGroupsofmealsCombs) {"
>
> Once that line is thrown away, the message about i makes sense:
>
> AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2,
> function(x)maf( tabulate( x+1) ))
>
> refers to a variable "i" that has never been defined.
>
> Duncan Murdoch
>
>>
>>
>> The lapply function :
>>   results<-lapply(SeparatedGroupsofmealsCombs , function(x)maf(tabulate(x+1)))
>> gives this error :-
>> Error in FUN(left, right) : non-numeric argument to binary operator
>>
>> I have been trying since yesterday but but until now I'm not able to identify
>> the correct syntax
>>
>>
>>
>>
>> ________________________________
>> From: David Winsemius <[hidden email]>
>> Sent: 18 November 2017 20:06:56
>> To: Allaisone 1
>> Cc: Boris Steipe; R-help
>> Subject: Re: [R] Complicated analysis for huge databases
>>
>>
>>> On Nov 18, 2017, at 1:52 AM, Allaisone 1 <[hidden email]> wrote:
>>>
>>> Although the loop seems to be formulated correctly I wonder why
>>> it gives me these errors :
>>>
>>> -object 'i' not found
>>> - unexpected '}' in "}"
>>
>> You probably did not copy the entire code offered. But we cannot know since you did not "show your code", not=r did you post complete error messages. Both of these practices are strongly recommended by the Posting Guide. Please read it (again?).
>>
>> --
>> David.
>>>
>>>
>>> the desired output is expected to be very large as for each dataframe in the list of dataframes I expect to see maf value for each of the 600 columns! and this is only for
>>>
>>> for one dataframe in the list .. I have around 150-200 dataframes.. not sure how R will store these results.. but first I need the analysis to be done correctly. The final output has to be something like this :-
>>>
>>>
>>>> mafsforeachcolumns(I,II,...600)foreachcombination
>>>
>>>      MealsCombinations    Cust.ID      I              II            III             IV       ...... 600
>>> 1          33-55                          1             0.124      0.10      0.65       0.467
>>>                                                  3
>>>                                                  5
>>>
>>> 2      44-66                                7           0.134     0.43       0.64       0.479
>>>                                                  4
>>>                                                  9
>>>
>>> .
>>>
>>> .
>>>
>>> ~180 dataframes
>>>
>>>
>>> ________________________________
>>> From: Boris Steipe <[hidden email]>
>>> Sent: 18 November 2017 00:35:16
>>> To: Allaisone 1; R-help
>>> Subject: Re: [R] Complicated analysis for huge databases
>>>
>>> Something like the following?
>>>
>>> AllMAFs <- list()
>>>
>>> for (i in length(SeparatedGroupsofmealsCombs) {
>>>  AllMAFs[[i]] <- apply( SeparatedGroupsofmealsCombs[[i]], 2, function(x)maf( tabulate( x+1) ))
>>> }
>>>
>>>
>>> (untested, of course)
>>> Also the solution is a bit generic since I don't know what the output of maf() looks like in your case, and I don't understand why you use tabulate because I would have assumed that's what maf() does - but that's not for me to worry about :-)
>>>
>>>
>>>
>>> B.
>>>
>>>
>>>
>>>> On Nov 17, 2017, at 7:15 PM, Allaisone 1 <[hidden email]> wrote:
>>>>
>>>>
>>>> Thanks Boris , this was very helpful but I'm struggling with the last part.
>>>>
>>>> 1) I combined the first 2 columns :-
>>>>
>>>>
>>>> library(tidyr)
>>>> SingleMealsCode <-unite(MyData, MealsCombinations, c(MealA, MealB), remove=FALSE)
>>>> SingleMealsCode <- SingleMealsCode[,-2]
>>>>
>>>>  2) I separated this dataframe into different dataframes based on "MealsCombination"
>>>>   column so R will recognize each meal combination separately :
>>>>
>>>> SeparatedGroupsofmealsCombs <- split(SingleMealCode,SingleMealCode$MealsCombinations)
>>>>
>>>> after investigating the structure of "SeparatedGroupsofmealsCombs" , I can see
>>>> a list of different databases, each of which represents a different Meal combinations which is great.
>>>>
>>>> No, I'm struggling with the last part, how can I run the maf code for all dataframes?
>>>>
>>>> when I run this code as before :-
>>>>
>>>> maf <- apply(SeparatedGroupsofmealsCombs, 2, function(x)maf(tabulate(x+1)))
>>>>
>>>> an error message says : dim(X) must have a positive length . I'm not sure which length
>>>> I need to specify.. any suggestions to correct this syntax ?
>>>>
>>>> Regards
>>>> Allaisone
>>>> From: Boris Steipe <[hidden email]>
>>>> Sent: 17 November 2017 21:12:06
>>>> To: Allaisone 1
>>>> Cc: R-help
>>>> Subject: Re: [R] Complicated analysis for huge databases
>>>>
>>>> Combine columns 1 and 2 into a column with a single ID like "33.55", "44.66" and use split() on these IDs to break up your dataset. Iterate over the list of data frames split() returns.
>>>>
>>>>
>>>> B.
>>>>
>>>>> On Nov 17, 2017, at 12:59 PM, Allaisone 1 <[hidden email]> wrote:
>>>>>
>>>>>
>>>>> Hi all ..,
>>>>>
>>>>>
>>>>> I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :-
>>>>>
>>>>>
>>>>>> MyData
>>>>>
>>>>>      Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600
>>>>>
>>>>> 1    33                 55             1             0           1        2       0
>>>>>
>>>>> 2    33                 55              3             1          0        2        2
>>>>>
>>>>> 3    33                 55              5             2          1        1         2
>>>>>
>>>>> 4    44                 66               7            0          2         2        2
>>>>>
>>>>> 5   44                  66               4            1          1          0       1
>>>>>
>>>>> 6   44                  66                9            2          0          1       2
>>>>>
>>>>> .
>>>>>
>>>>> .
>>>>>
>>>>> 600,000
>>>>>
>>>>>
>>>>>
>>>>> I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
>>>>>
>>>>>
>>>>> I can do the analysis  for the entire column but not group by group like this :
>>>>>
>>>>>
>>>>> MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
>>>>>
>>>>> How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns.
>>>>>
>>>>> In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns.
>>>>> I have another sheet contains names of columns of interest like this :
>>>>>
>>>>>> ColOfinterest
>>>>>
>>>>> Col
>>>>> I
>>>>> IV
>>>>> V
>>>>> .
>>>>> .
>>>>> 300
>>>>>
>>>>> Any one would help with the best combination of syntax to perform this complex analysis?
>>>>>
>>>>> Regards
>>>>> Allaisone
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>       [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius
>> Alameda, CA, USA
>>
>> 'Any technology distinguishable from magic is insufficiently advanced.'   -Gehm's Corollary to Clarke's Third Law
>>
>>
>>
>>
>>
>>
>>       [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [hidden email] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[hidden email] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.