#StackBounty: #r #dataframe #group-by Populating a new data frame with summaries from a large dataset of counts and years

Bounty: 50

I’m trying to teach myself about manipulating and summarizing large datasets. I’d like to create a new data frame and populate it with summary data about the years associated with count observations by location. I’ve created an example data frame below.

example.frame <- data.frame(
                   "Obs.ID" = 1:50, 
                   "Species" = rep("T. rex", 50),
                   "Site" = c(rep("Big Red", 24), rep("Supermax", 26)),
                   "Site.ID" = c(rep("1578", 24), rep("0185", 26)),
                   "Year" = c(1999, 1999, 1999, 2000, 2001, 2002, 2002,
                              2003, 2003, 2003, 2003, 2003, 2003, 2004,
                              2004, 2004, 2004, 2004, 2005, 2005, 2005, 
                              2006, 2006, 2007, 1978, 1978, 1978, 1978,
                              1979, 1979, 1999, 1999, 2000, 2000, 2000,
                              2000, 2000, 2001, 2001, 2001, 2002, 2003,
                              2003, 2003, 2003, 2004, 2005, 2006, 2006,
                              2006),
                   "Count" = c(0, 1, 5, 0, 3, 1, 1, 0, 1, 3, 2, 1, 1, 0,
                               0, 1, 2, 3, 1, 1, 5, 0, 1, 2, 8, 11, 7,
                               2, 3, 1, 1, 0, 2, 5, 6, 0, 1, 2, 1, 1, 0,
                               0, 2, 3, 1, 2, 0, 1, 2, 1),
                   stringsAsFactors = FALSE)

I’d like to create a new data frame summarizing these example data and include the following columns, defined below.

Line.ID – A unique sequential identifier for the line in the data frame starting with 1
Site – The site name from example.data
Total.years – The total number of unique years associated with the site
Years.3 – The number of years with at least three associated counts for the site, including zeros
Years.4 – The number of years with at least four associated counts for the site, including zeros
Years.5 – The number of years with at least five or more associated counts for the site, including zeros
Total.pos – The total number of years with at least one count for the site above zero
Pos.3 – The number of years with at least three counts for the site above zero
Pos.4 – The number of years with at least four counts for the site above zero
Pos.5 – The number of years with at least five counts for the site above zero

The new data frame should look like this:

new.frame <- data.frame(
               "Line.ID" = c(1, 2),
               "Site" = c("Big Red", "Supermax"),
               "Total.years" = c(9, 10),
               "Years.3" = c(4, 5),
               "Years.4" = c(2, 3),
               "Years.5" = c(2, 1),
               "Total.pos" = c(8, 8),
               "Pos.3" = c(3, 5),
               "Pos.4" = c(1, 2),
               "Pos.5" = c(1, 0),
               stringsAsFactors = FALSE)

I think the right way to go about it would be some combination of ‘summarise’ and ‘group_by’ in dplyr, but I can’t figure out how to put it all together. I wasn’t able to find an answered question that would work for this scenario, so I thought this would be useful to post.

Follow up question:
How would I incorporate an additional layer into creating the summary table (for example, adding an additional species that occurs at the same sites)? Example data frame below.

example.frame.2 <- data.frame(
                     "Obs.ID" = 1:80,
                     "Species" = c(rep("T. rex", 50),
                                   rep("T. bataar", 30)),
                     "Site" = c(rep("Big Red", 24),
                                rep("Supermax", 26),
                                rep("Big Red", 16),
                                rep("Supermax", 10),
                                rep("Oz", 4)),
                     "Site.ID" = c(rep("1578", 24), rep("0185", 26),
                                   rep("1578", 16), rep("0185", 10),
                                   rep("2115", 4)),
                     "Year" = c(1999, 1999, 1999, 2000, 2001, 2002, 2002,
                                2003, 2003, 2003, 2003, 2003, 2003, 2004,
                                2004, 2004, 2004, 2004, 2005, 2005, 2005,
                                2006, 2006, 2007, 1978, 1978, 1978, 1978,
                                1979, 1979, 1999, 1999, 2000, 2000, 2000,
                                2000, 2000, 2001, 2001, 2001, 2002, 2003,
                                2003, 2003, 2003, 2004, 2005, 2006, 2006,
                                2006, 2003, 2003, 2003, 2003, 2003, 2004,
                                2004, 2004, 2004, 2004, 2005, 2005, 2005,
                                2006, 2006, 2007, 1978, 1978, 1978, 1978,
                                1979, 1979, 1999, 1999, 2000, 2000, 2012,
                                2012, 2012, 2013),
                     "Count" = c(0, 1, 5, 0, 3, 1, 1, 0, 1, 3, 2, 1, 1, 0, 0,
                                 1, 2, 3, 1, 1, 5, 0, 1, 2, 8, 11, 7, 2, 3,
                                 1, 1, 0, 2, 5, 6, 0, 1, 2, 1, 1, 0, 0, 2, 3,
                                 1, 2, 0, 1, 2, 1, 1, 3, 2, 1, 1, 0, 0, 1, 2,
                                 3, 1, 1, 5, 0, 1, 2, 8, 11, 7, 2, 3, 1, 1,
                                 0, 2, 5, 1, 1, 3, 0),
                     stringsAsFactors = FALSE)

Summary data frame with species layer below.

new.frame.2 <- data.frame(
                 "Line.ID" = c(1, 2, 3, 4, 5),
                 "Species" = c(rep("T. rex", 2), rep("T. bataar", 3)),
                 "Site" = c("Big Red", "Supermax", "Big Red", "Supermax", "Oz"),
                 "Total.years" = c(9, 10, 5, 4, 2),
                 "Years.3" = c(4, 5, 3, 1, 1),
                 "Years.4" = c(2, 3, 2, 1, 0),
                 "Years.5" = c(2, 1, 2, 0, 0),
                 "Total.pos" = c(8, 8, 5, 4, 1),
                 "Pos.3" = c(3, 5, 3, 1, 1),
                 "Pos.4" = c(1, 2, 1, 1, 0),
                 "Pos.5" = c(1, 0, 1, 0, 0),
                 stringsAsFactors = FALSE)


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.