#StackBounty: #python #pandas #pandas-groupby Filter out data based on dynamic, ARRAYlike column values

Bounty: 50

I have sample data which can be downloaded from here. What I want to do, to filter this data based on element_counts, cluster_choice_prob_k_fold, benchmark_probabilities columns for every individual.

This is the logic:
element_counts column has an array-like structure [3, 2, 0, 0, 0] (in this case). It shows, how many elements I should select from top values of benchmark_probabilities from each top5 cluster_choice_prob_k_fold values. In this example, this array tells that:

  1. From the biggestcluster_choice_prob_k_fold group select 3 values based on their benchmark_probabilities
  2. From the second biggest cluster_choice_prob_k_fold group select 2 values based on their benchmark_probabilities
  3. From the third biggest cluster_choice_prob_k_fold group select 0 values based on their benchmark_probabilities
  4. Similar to 3
  5. Similar to 4.

So, my end result should look like this:

individual  cluster_choice_prob_k_fold  benchmark_probabilities element_counts
9710535 0.512776    0.163837    [3, 2, 0, 0, 0]
9710535 0.512776    0.0986      [3, 2, 0, 0, 0]
9710535 0.512776    0.085191    [3, 2, 0, 0, 0]
9710535 0.294674    0.050787    [3, 2, 0, 0, 0]
9710535 0.294674    0.037609    [3, 2, 0, 0, 0]

For other individual values, I have element_counts array which look like differently, but always with 5 elements and always element values adding up to 5. How can I do it with Pandas? Difficulty is, to write a function which can take into account different arrays and use it with groupby


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.