Hi all,
I am trying to read and summarize a big data frame( >10M records) Here is the sample of my data state,city,x 1,12,100 1,12,100 1,12,200 1,13,200 1,13,100 1,13,100 1,14,200 2,21,200 2,21,200 2,21,100 2,23,100 2,23,200 2,34,200 2,34,100 2,35,100 I want get the total count by state, and the the number of cities by state. The x variable is either 100 or 200 and count each The result should look like as follows. state,city,count,100's,200's 1,3,7,4,3 2,4,8,4,4 At the present I am doing it in several steps and taking too long Is there an efficient way of doing this? ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
This should be reasonably efficient with 'dplyr':
> library(dplyr) > input <- read.csv(text = "state,city,x + 1,12,100 + 1,12,100 + 1,12,200 + 1,13,200 + 1,13,100 + 1,13,100 + 1,14,200 + 2,21,200 + 2,21,200 + 2,21,100 + 2,23,100 + 2,23,200 + 2,34,200 + 2,34,100 + 2,35,100") > > result <- input %>% + group_by(state) %>% + summarise(nCities = length(unique(city)), + count = n(), + `100's` = sum(x == 100), + `200's` = sum(x == 200) + ) > result # A tibble: 2 × 5 state nCities count `100's` `200's` <int> <int> <int> <int> <int> 1 1 3 7 4 3 2 2 4 8 4 4 Or you can also use data.table: > library(data.table) > input <- fread("state,city,x + 1,12,100 + 1,12,100 + 1,12,200 + 1,13,200 + 1,13,100 + 1,13,100 + 1,14,200 + 2,21,200 + 2,21,200 + 2,21,100 + 2,23,100 + 2,23,200 + 2,34,200 + 2,34,100 + 2,35,100") > > input[, .(nCities = length(unique(city)), + count = .N, + `100's` = sum(x == 100), + `200's` = sum(x == 200) + ) + , keyby = state + ] state nCities count 100's 200's 1: 1 3 7 4 3 2: 2 4 8 4 4 Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Sat, Dec 3, 2016 at 10:40 AM, Val <[hidden email]> wrote: > Hi all, > > I am trying to read and summarize a big data frame( >10M records) > > Here is the sample of my data > state,city,x > 1,12,100 > 1,12,100 > 1,12,200 > 1,13,200 > 1,13,100 > 1,13,100 > 1,14,200 > 2,21,200 > 2,21,200 > 2,21,100 > 2,23,100 > 2,23,200 > 2,34,200 > 2,34,100 > 2,35,100 > > I want get the total count by state, and the the number of cities > by state. The x variable is either 100 or 200 and count each > > The result should look like as follows. > > state,city,count,100's,200's > 1,3,7,4,3 > 2,4,8,4,4 > > At the present I am doing it in several steps and taking too long > > Is there an efficient way of doing this? > > ______________________________________________ > [hidden email] mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ [hidden email] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. |
Free forum by Nabble | Edit this page |