#StackBounty: #apache-spark #group-by #pyspark #k-means Pyspark: applying kmeans on different groups of a dataframe

Bounty: 50

Using Pyspark I would like to apply kmeans separately on groups of a dataframe and not to the whole dataframe at once. For the moment I use a for loop which iterates on each group, applies kmeans and appends the result to another table. But having a lot of groups makes it time consuming. Anyone could help me please??
Thanks a lot!

for customer in customer_list:
    temp_df = togroup.filter(col("customer_id")==customer)
    df = assembler.transform(temp_df)
    k = 1
    while (k < 5 & mtrc < width):
        k += 1
        kmeans = KMeans(k=k,seed=5,maxIter=20,initSteps=5)
        model = kmeans.fit(df)
        mtric = 1 - model.computeCost(df)/ttvar
        a = model.transform(df)select(cols)
        allcustomers = allcustomers .union(a)


Get this bounty!!!

#StackBounty: #r #dataframe #group-by Populating a new data frame with summaries from a large dataset of counts and years

Bounty: 50

I’m trying to teach myself about manipulating and summarizing large datasets. I’d like to create a new data frame and populate it with summary data about the years associated with count observations by location. I’ve created an example data frame below.

example.frame <- data.frame(
                   "Obs.ID" = 1:50, 
                   "Species" = rep("T. rex", 50),
                   "Site" = c(rep("Big Red", 24), rep("Supermax", 26)),
                   "Site.ID" = c(rep("1578", 24), rep("0185", 26)),
                   "Year" = c(1999, 1999, 1999, 2000, 2001, 2002, 2002,
                              2003, 2003, 2003, 2003, 2003, 2003, 2004,
                              2004, 2004, 2004, 2004, 2005, 2005, 2005, 
                              2006, 2006, 2007, 1978, 1978, 1978, 1978,
                              1979, 1979, 1999, 1999, 2000, 2000, 2000,
                              2000, 2000, 2001, 2001, 2001, 2002, 2003,
                              2003, 2003, 2003, 2004, 2005, 2006, 2006,
                              2006),
                   "Count" = c(0, 1, 5, 0, 3, 1, 1, 0, 1, 3, 2, 1, 1, 0,
                               0, 1, 2, 3, 1, 1, 5, 0, 1, 2, 8, 11, 7,
                               2, 3, 1, 1, 0, 2, 5, 6, 0, 1, 2, 1, 1, 0,
                               0, 2, 3, 1, 2, 0, 1, 2, 1),
                   stringsAsFactors = FALSE)

I’d like to create a new data frame summarizing these example data and include the following columns, defined below.

Line.ID – A unique sequential identifier for the line in the data frame starting with 1
Site – The site name from example.data
Total.years – The total number of unique years associated with the site
Years.3 – The number of years with at least three associated counts for the site, including zeros
Years.4 – The number of years with at least four associated counts for the site, including zeros
Years.5 – The number of years with at least five or more associated counts for the site, including zeros
Total.pos – The total number of years with at least one count for the site above zero
Pos.3 – The number of years with at least three counts for the site above zero
Pos.4 – The number of years with at least four counts for the site above zero
Pos.5 – The number of years with at least five counts for the site above zero

The new data frame should look like this:

new.frame <- data.frame(
               "Line.ID" = c(1, 2),
               "Site" = c("Big Red", "Supermax"),
               "Total.years" = c(9, 10),
               "Years.3" = c(4, 5),
               "Years.4" = c(2, 3),
               "Years.5" = c(2, 1),
               "Total.pos" = c(8, 8),
               "Pos.3" = c(3, 5),
               "Pos.4" = c(1, 2),
               "Pos.5" = c(1, 0),
               stringsAsFactors = FALSE)

I think the right way to go about it would be some combination of ‘summarise’ and ‘group_by’ in dplyr, but I can’t figure out how to put it all together. I wasn’t able to find an answered question that would work for this scenario, so I thought this would be useful to post.

Follow up question:
How would I incorporate an additional layer into creating the summary table (for example, adding an additional species that occurs at the same sites)? Example data frame below.

example.frame.2 <- data.frame(
                     "Obs.ID" = 1:80,
                     "Species" = c(rep("T. rex", 50),
                                   rep("T. bataar", 30)),
                     "Site" = c(rep("Big Red", 24),
                                rep("Supermax", 26),
                                rep("Big Red", 16),
                                rep("Supermax", 10),
                                rep("Oz", 4)),
                     "Site.ID" = c(rep("1578", 24), rep("0185", 26),
                                   rep("1578", 16), rep("0185", 10),
                                   rep("2115", 4)),
                     "Year" = c(1999, 1999, 1999, 2000, 2001, 2002, 2002,
                                2003, 2003, 2003, 2003, 2003, 2003, 2004,
                                2004, 2004, 2004, 2004, 2005, 2005, 2005,
                                2006, 2006, 2007, 1978, 1978, 1978, 1978,
                                1979, 1979, 1999, 1999, 2000, 2000, 2000,
                                2000, 2000, 2001, 2001, 2001, 2002, 2003,
                                2003, 2003, 2003, 2004, 2005, 2006, 2006,
                                2006, 2003, 2003, 2003, 2003, 2003, 2004,
                                2004, 2004, 2004, 2004, 2005, 2005, 2005,
                                2006, 2006, 2007, 1978, 1978, 1978, 1978,
                                1979, 1979, 1999, 1999, 2000, 2000, 2012,
                                2012, 2012, 2013),
                     "Count" = c(0, 1, 5, 0, 3, 1, 1, 0, 1, 3, 2, 1, 1, 0, 0,
                                 1, 2, 3, 1, 1, 5, 0, 1, 2, 8, 11, 7, 2, 3,
                                 1, 1, 0, 2, 5, 6, 0, 1, 2, 1, 1, 0, 0, 2, 3,
                                 1, 2, 0, 1, 2, 1, 1, 3, 2, 1, 1, 0, 0, 1, 2,
                                 3, 1, 1, 5, 0, 1, 2, 8, 11, 7, 2, 3, 1, 1,
                                 0, 2, 5, 1, 1, 3, 0),
                     stringsAsFactors = FALSE)

Summary data frame with species layer below.

new.frame.2 <- data.frame(
                 "Line.ID" = c(1, 2, 3, 4, 5),
                 "Species" = c(rep("T. rex", 2), rep("T. bataar", 3)),
                 "Site" = c("Big Red", "Supermax", "Big Red", "Supermax", "Oz"),
                 "Total.years" = c(9, 10, 5, 4, 2),
                 "Years.3" = c(4, 5, 3, 1, 1),
                 "Years.4" = c(2, 3, 2, 1, 0),
                 "Years.5" = c(2, 1, 2, 0, 0),
                 "Total.pos" = c(8, 8, 5, 4, 1),
                 "Pos.3" = c(3, 5, 3, 1, 1),
                 "Pos.4" = c(1, 2, 1, 1, 0),
                 "Pos.5" = c(1, 0, 1, 0, 0),
                 stringsAsFactors = FALSE)


Get this bounty!!!

#StackBounty: #scala #join #activerecord #group-by Write join query with groupby in Scala ActiveRecord

Bounty: 150

I am trying to write a specific query in scala Active record. But it always returns nothing. I have read the wiki on the github page but it does not contain a lot of info on it. The query I am trying to write is

SELECT e.name, e.id, COUNT(pt.pass_id) as pass_count, e.start_date, e.total_passes_to_offer
FROM events e inner join passes p on e.id = p.event_id inner join pass_tickets pt on p.id = pt.pass_id where e.partner_id = 198 group by e.name, e.id

What I have tried is

Event.joins[Pass, PassTicket](
                (event, pass, passTicket) => (event.id === pass.eventId, pass.id === passTicket.passId)
            ).where(
                (event, _, _) => event.partnerId === partnerId
            ).select(
                (event, pass, _) => (event.name, event.id, PassTicket.where(_.passId === pass.id).count, event.startDate, event.totalPassesToOffer)
            ).groupBy( data => data._2)

But first, the return type becomes a map, not a list. And second when executed, it doesnt return anything even though the data exists and the SQL executes fine.


Get this bounty!!!