#StackBounty: #python #numpy #optimization #cvxpy Inconsistency in solutions using CVXPY

Bounty: 50

Please, consider the following optimisation problem. Specifically, x and b are (1,n) vectors, C is (n,n) symmetric matrix, k is an arbitrary constant and i is a (1,n) vector of ones.

enter image description here

Please, also consider the following equivalent optimisation problem. In such case, k is determined during the optimisation process so there is no need to scale the values in x to obtain the solution y.

enter image description here

Please, also consider the following code for solving both the problems with cvxpy.

import cvxpy as cp
import numpy as np

    def problem_1(C):
        n, t = np.shape(C)
        
        x = cp.Variable(n)
        b = np.array([1 / n] * n)
        
        obj =  cp.quad_form(x, C)
        constraints = [b.T @ cp.log(x)>=0.5, w >= 0]
        cp.Problem(cp.Minimize(obj), constraints).solve()
    
        return (x.value / (np.ones(n).T @ x.value))
    
    def problem_2(C):
        n, t = np.shape(C)
        
        y = cp.Variable(n)
        k = cp.Variable()
        b = np.array([1 / n] * n)
        
        obj =  cp.quad_form(y, C)
        constraints = [b.T @ cp.log(y)>=k, np.ones(n)@y.T==1, y >= 0]
        cp.Problem(cp.Minimize(obj), constraints).solve()
        
        return y.value

While the first function do provide me with the correct solution for a sample set of data I am using, the second does not. Specifically, values in y differ heavily while employing the second function with some of them being equal to zero (which cannot be since all values in b are positive and greater than zero). I am wondering wether or not the second function minimise also k. Its value should not be minimised on the contrary it should just be determined during the optimisation problem as the one that leads to the solution that minimise the objective function.

UPDATE_1

I just found that the solution that I obtain with the second formulation of the problem is equal to the one derived with the following equations and function. It appears that the constraint with the logarithmic barrier and the k variable is ignored.

enter image description here

def problem_3(C):
    n, t = np.shape(C)
    
    y = cp.Variable(n)
    k = cp.Variable()
    b = np.array([1 / n] * n)
    
    obj =  cp.quad_form(y, C)
    constraints = np.ones(n)@y.T==1, y >= 0]
    cp.Problem(cp.Minimize(obj), constraints).solve()
    
    return y.value

UPDATE_2

Here is the link to a sample input Chttps://www.dropbox.com/s/kaa7voufzk5k9qt/matrix_.csv?dl=0. In such case the correct output for both problem_1 and problem_2 is approximately equal to [0.0659 0.068 0.0371 0.1188 0.1647 0.3387 0.1315 0.0311 0.0441] since they are equivalent by definition. I am able to obtain the the correct output by solving only problem_1. Solving problem_2 leads to [0.0227 0. 0. 0.3095 0.3392 0.3286 0. 0. 0. ] which is wrong since it happens to be the correct output for problem_3.


Get this bounty!!!

#StackBounty: #pandas #numpy Remove the requirement to loop through numpy array

Bounty: 100

Overview

The code below contains a numpy array clusters with values that are compared against each row of a pandas Dataframe using np.where. The SoFunc function returns rows where all conditions are True and takes the clusters array as input.

Question

I can loop through this array to compare each array element against the respective np.where conditions. How do I remove the requirement to loop but still get the same output?

I appreciate looping though numpy arrays is inefficient and want to improve this code. The actual dataset will be much larger.

Prepare the reproducible mock data

def genMockDataFrame(days,startPrice,colName,startDate,seed=None): 

    periods = days*24
    np.random.seed(seed)
    steps = np.random.normal(loc=0, scale=0.0018, size=periods)
    steps[0]=0
    P = startPrice+np.cumsum(steps)
    P = [round(i,4) for i in P]

    fxDF = pd.DataFrame({ 
        'ticker':np.repeat( [colName], periods ),
        'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
        'price':(P)})
    fxDF.index = pd.to_datetime(fxDF.date)
    fxDF = fxDF.price.resample('D').ohlc()
    fxDF.columns = [i.title() for i in fxDF.columns]
    return fxDF


def SoFunc(clust):
    #generate mock data
    df = genMockDataFrame(10,1.1904,'eurusd','19/3/2020',seed=157)
    df["Upper_Band"] = 1.1928
    df.loc["2020-03-27"]["Upper_Band"] = 1.2118
    df.loc["2020-03-26"]["Upper_Band"] = 1.2200
    df["Level"] = np.where((df["High"] >= clust)
                                      & (df["Low"] <= clust)
                                     & (df["High"] >= df["Upper_Band"] ),1,np.NaN
                                      )
    return df.dropna()

Loop through the clusters array

clusters = np.array([1.1929   , 1.2118 ])

l = []

for i in range(len(clusters)):
    l.append(SoFunc(clusters[i]))
    
pd.concat(l)

Output

              Open  High    Low    Close    Upper_Band  Level
date                        
2020-03-19  1.1904  1.1937  1.1832  1.1832  1.1928      1.0
2020-03-25  1.1939  1.1939  1.1864  1.1936  1.1928      1.0
2020-03-27  1.2118  1.2144  1.2039  1.2089  1.2118      1.0


Get this bounty!!!

#StackBounty: #python #pandas #dataframe #numpy If value of a column is less then 3 then replace value in another column with value fro…

Bounty: 50

I have a large dataframe (100.000 rows) with many columns. These are the relevant columns for my question:

id   herd        birth     H_BY  HYcount      death       H_DY   HYcount2
1    1345   2005-01-09    134505       1  2010-01-09    134510       1
2    1345   2010-03-05    134510       2  2015-01-09    134515       2
3    1345   2010-05-10    134510       2  2015-01-09    134515       2
4    1345   2011-06-01    134511       1  2016-01-09    134516       1
5    1345   2012-09-01    134512       1  2017-01-09    134517       2
6    1345   2015-09-13    134515       4  2017-01-09    134517       2
7    1346   2015-10-01    134615       4  2019-01-09    134619       1
8    1346   2015-10-27    134615       4  2020-01-09    134620       2
9    1346   2015-11-10    134615       4  2020-01-09    134620       2
10   1346   2016-12-10    134616       1  2021-01-09    134621       1

I am creating Herd-year fixed effects.
I have already combined the herd and birth/death columns into herd+birth year and herd+death year into separate columns and counted how many time each fixed effect appears in the dataframe. As can be seen above.

However, now I want to check my whole dataframe for HYcount and HYcount2 that are less then 3. So I don’t want any that are either 1 or 2 in a HY group.

I would like to run though the dataframe and combine those HY groups that are 1 or 2 per group into other groups. Below or above.

EDIT

I also want to only combine HY groups WITHIN EACH HERD!

So I don’t want to add a herd member of one herd to another herd with the Herd-year variable.

Here is what I’ve tried with the birth year fixed effect.

#Sort the df by the relevant value
df= df.sort_values(by=['H_BY'])


df.loc[
    (df['HYcount'] < 3),
    'H_BY'] = df['H_BY'].shift(-1)

#Count the values again 
df['HC1_c'] = df.groupby('H_BY')['H_BY'].transform('count')

But this is a very feeble attempt. I have to run through this many many times to rid my dataframe of all values that are less then 3 and it does not work with record number 1. And I want to repeat this process over at least 4 other columns.

EDIT

And of course this code does not do anything about combining within one herd.

Any tips and trick or ideas how I can do this more efficiently?


Get this bounty!!!

#StackBounty: #python #performance #algorithm #python-3.x #numpy A Prime-Generating Algorithm and Python Script

Bounty: 100

I was wondering if it would be possible to optimise my python script? It is designed to quickly generate and print all the prime numbers under some inputted number. It is fairly fast currently, and I am wondering about its current time complexity, however, I’m also curious to know if there are further improvements and optimisations that can be done to make it even faster? Additionally, I’ve been trying to implement njit with Numba, however, I’m getting rather nasty-looking errors, and I’m wondering if anyone knows how to fix this, or if implementing Numba is worthwhile to begin with. I’ve also been compiling to C with Nuitka, which has also caused a speed up.

To summarise:

  • What is its current time complexity?
  • Are there any ways to improve it?
  • Could Numba be implemented, and if so, would it be worthwhile with a significant speed up?

I also appreciate any alternatives that are faster and/or are more memory-efficient (as long as ‘memory-efficient’ doesn’t cause a significant increase in completion time). I would like the final list of primes to be sorted if possible.

import numpy as np
import math

def primes(n):
    sieve = np.ones(n // 3 + (n % 6 == 2), dtype = bool)
    sieve[0] = False
    for i in range(math.isqrt(n) // 3 + 1):
        if sieve[i]:
            k = 3 * i + 1 | 1
            a = k * k
            b = 2 * k
            sieve[(a // 3) :: b] = False
            sieve[(a + b * (2 - (i & 1))) // 3 :: b] = False

    return np.r_[2, 3, ((3 * np.nonzero(sieve)[0] + 1) | 1)].tolist()

if __name__ == "__main__": 
    n = int(input("What value do you want to check up to? "))
    print(primes(n))


Get this bounty!!!

#StackBounty: #python #performance #python-3.x #reinventing-the-wheel #numpy A Prime-Generating Algorithm and Python Script

Bounty: 100

I was wondering if it would be possible to optimise my python script? It is designed to quickly generate and print all the prime numbers under some inputted number. It is fairly fast currently, and I am wondering about its current time complexity, however, I’m also curious to know if there are further improvements and optimisations that can be done to make it even faster? Additionally, I’ve been trying to implement njit with Numba, however, I’m getting rather nasty-looking errors, and I’m wondering if anyone knows how to fix this, or if implementing Numba is worthwhile to begin with. I’ve also been compiling to C with Nuitka, which has also caused a speed up.

To summarise:

  • What is its current time complexity?
  • Are there any ways to improve it?
  • Could Numba be implemented, and if so, would it be worthwhile with a significant speed up?

I also appreciate any alternatives that are faster and/or are more memory-efficient (as long as ‘memory-efficient’ doesn’t cause a significant increase in completion time).

import numpy as np
import math

def primes(n):
    sieve = np.ones(n // 3 + (n % 6 == 2), dtype = bool)
    sieve[0] = False
    for i in range(math.isqrt(n) // 3 + 1):
        if sieve[i]:
            k = 3 * i + 1 | 1
            a = k * k
            b = 2 * k
            sieve[(a // 3) :: b] = False
            sieve[(a + b * (2 - (i & 1))) // 3 :: b] = False

    return np.r_[2, 3, ((3 * np.nonzero(sieve)[0] + 1) | 1)].tolist()

if __name__ == "__main__": 
    n = int(input("What value do you want to check up to? "))
    print(primes(n))


Get this bounty!!!

#StackBounty: #python #pandas #numpy #rolling-computation #pearson-correlation How to tackle inconsistent results while using pandas ro…

Bounty: 50

Let me preface this by saying, in order to reproduce the problem I need a large data, and the data is too large (~13k rows, 2 cols) to be pasted in the question, I have added a pastebin link at the end of the post.


I am facing a peculiar problem for the past few days with pandas.core.window.rolling.Rolling.corr. I have a dataset, where I am trying to calculate rolling correlations. This is the problem:

While calculating rolling (window_size=100) correlations between two columns (a and b): some indices (one such index is 12981) give near 0 values (of order 1e-10), but it should ideally return nan or inf, (because all values in one column are constant). However, if I just calculate standalone correlation pertaining to that index, (i.e. last 100 rows of data including the said index), or perform the rolling calculations on lesser amount of rows (e.g. 300 or 1000 as opposed to 13k), I get the correct result (i.e. nan or inf.)

Expectation:

>>> df = pd.read_csv('sample_corr_data.csv') # link at the end,  ## columns = ['a', 'b']
>>> df.a.tail(100).value_counts()

 0.000000    86
-0.000029     3
 0.000029     3
-0.000029     2
 0.000029     2
-0.000029     2
 0.000029     2
Name: a, dtype: int64

>>> df.b.tail(100).value_counts()     # all 100 values are same
 
6.0    100
Name: b, dtype: int64

>>> df.a.tail(100).corr(df.b.tail(100))
nan                                      # expected, because column 'b' has same value throughout

# Made sure of this using,
# 1. np.corrcoef, because pandas uses this internally to calculate pearson moments
>>> np.corrcoef(df.a.tail(100), df.b.tail(100))[0, 1]
nan

# 2. using custom function
>>> def pearson(a, b):
        n = a.size
        num = n*np.nansum(a*b) - np.nansum(a)*np.nansum(b)
        den = (n*np.nansum((a**2)) - np.nansum(a)**2)*(n*np.nansum(b**2) - np.nansum(b)**2)
        return num/np.sqrt(den) if den * np.isfinite(den*num) else np.nan

>>> pearson(df.a.tail(100), df.b.tail(100))
nan

Now, the reality:

>>> df.a.rolling(100).corr(df.b).tail(3)
 
12979    7.761921e-07
12980    5.460717e-07
12981    2.755881e-10                    # This should have been NaN/inf !!

## Furthermore!!

>>> debug = df.tail(300)
>>> debug.a.rolling(100).corr(debug.b).tail(3)

12979    7.761921e-07
12980    5.460717e-07
12981            -inf                    # Got -inf, fine
dtype: float64

>>> debug = df.tail(3000)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
 
12979    7.761921e-07
12980    5.460717e-07
12981             inf                     # Got +inf, still acceptable
dtype: float64

This continue till 9369 rows:

>>> debug = df.tail(9369)
>>> debug.a.rolling(100).corr(debug.b).tail(3)

12979    7.761921e-07
12980    5.460717e-07
12981             inf
dtype: float64

# then
>>> debug = df.tail(9370)
>>> debug.a.rolling(100).corr(debug.b).tail(3)

12979    7.761921e-07
12980    5.460717e-07
12981    4.719615e-10                    # SPOOKY ACTION IN DISTANCE!!!
dtype: float64

>>> debug = df.tail(10000)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
 
12979    7.761921e-07
12980    5.460717e-07
12981    1.198994e-10                    # SPOOKY ACTION IN DISTANCE!!!    
dtype: float64

Current Workaround

>>> df.a.rolling(100).apply(lambda x: x.corr(df.b.reindex(x.index))).tail(3)   # PREDICTABLY, VERY SLOW!

12979    7.761921e-07
12980    5.460717e-07
12981             NaN
Name: a, dtype: float64

# again this checks out using other methods,
>>> df.a.rolling(100).apply(lambda x: np.corrcoef(x, df.b.reindex(x.index))[0, 1]).tail(3)
 
12979    7.761921e-07
12980    5.460717e-07
12981             NaN
Name: a, dtype: float64

>>> df.a.rolling(100).apply(lambda x: pearson(x, df.b.reindex(x.index))).tail(3)

12979    7.761921e-07
12980    5.460717e-07
12981             NaN
Name: a, dtype: float64

As far as I understand, the result of series.rolling(n).corr(other_series) should match with the following:

>>> def rolling_corr(series, other_series, n=100)
        return pd.Series(
                    [np.nan]*(n-1) + [series[i-n: i].corr(other_series[i-n:i]) 
                    for i in range (n, series.size+1)]
        )

>>> rolling_corr(df.a, df.b).tail(3)

12979    7.761921e-07
12980    5.460717e-07
12981             NaN

First I thought this was a floating-point arithmetic issue (because initially, in some cases, I could fix this by rounding column ‘a’ to 5 decimal places, or casting to float32), but in that case it would be present irrespective of the number of samples used. So there must be some issue with rolling or at least rolling gives rise to floating-point issues depending on size of the data. I checked source code of rolling.corr, but could not find anything that would explain such inconsistencies. And now I am worried, how many past codes are plagued with this issue.

What is the reason behind this? And how to fix this? If this is happening because may be pandas prefers speed over accuracy (as suggested here), does that mean I can never reliably use pandas.rolling operations on large sample? How do I know the size beyond which this inconsistency would appear?


sample_corr_data.csv: https://pastebin.com/jXXHSv3r

Tested in

  • Windows 10, python 3.9.1, pandas 1.2.2, (IPython 7.20)
  • Windows 10, python 3.8.2, pandas 1.0.5, (IPython 7.19)
  • Ubuntu 20.04, python 3.7.7, pandas 1.0.5, (GCC 7.3.0, standard REPL)
  • CentOS Linux 7 (Core), Python 2.7.5, pandas 0.23.4, (IPython 5.8.0)

Note: Different OS return different values at the said index, but all are finite and near 0.


Get this bounty!!!

#StackBounty: #python #pandas #dataframe #numpy Calculate how much of a trajectory/path falls in-between two other trajectories

Bounty: 50

In a broad sense, I’m trying to calculate how much of the red path/trajectory falls in-between the black paths for many different trials (see plot below).
I circled a couple examples, where for (0, 1, 3) approx 30 – 40% of the red path falls in-between the two black paths, but for (2, 1, 3) only about 1 – 2% of the red path is in-between the two black paths.

enter image description here

I have two dataframes, df_R & df_H.

df_R contains the position data for the red paths (in X & Z).

Preview of df_R:

    (0, 1, 1)_mean_X  (0, 1, 1)_mean_Z  ...  (2, 2, 3)_mean_X  (2, 2, 3)_mean_Z
0         -15.856713          5.002617  ...        -15.600160         -5.010470
1         -15.831320          5.003529  ...        -15.566172         -5.012251
2         -15.805927          5.004441  ...        -15.532184         -5.014032
3         -15.780534          5.005353  ...        -15.498196         -5.015814
4         -15.755141          5.006265  ...        -15.464208         -5.017595
..               ...               ...  ...               ...               ...
95        -12.818362          5.429729  ...        -12.391177         -5.391595
96        -12.783905          5.437335  ...        -12.357563         -5.396919
97        -12.749456          5.444990  ...        -12.323950         -5.402243
98        -12.715017          5.452697  ...        -12.290336         -5.407567
99        -12.680594          5.460469  ...        -12.256722         -5.412891

df_H contains the position data for the black paths, which includes a ‘top’ and ‘bottom’ column for X and for Z, corresponding to the two black paths in each plot.

Preview of df_H:

    (0, 1, 1)_top_X  (0, 1, 1)_bottom_X  ...  (2, 2, 3)_top_Z  (2, 2, 3)_bottom_Z
0        -16.000000          -16.000000  ...        -5.000000           -5.000000
1        -16.000000          -16.000000  ...        -5.000000           -5.000000
2        -16.000000          -16.000000  ...        -5.000000           -5.000000
3        -16.000000          -16.000000  ...        -5.000000           -5.000000
4        -16.000000          -16.000000  ...        -5.000000           -5.000000
..              ...                 ...  ...              ...                 ...
95       -15.000971          -15.417215  ...        -4.993461           -5.011372
96       -14.979947          -15.402014  ...        -4.993399           -5.013007
97       -14.957949          -15.385840  ...        -4.993291           -5.014463
98       -14.934171          -15.368649  ...        -4.993186           -5.015692
99       -14.908484          -15.349371  ...        -4.993069           -5.016940

For each column in df_R, I need to see whether the X/Z value for that row is less than the top_X/Z and greater than the bottom X/Z value in df_H. If it is, then set that row = 1 in a new dataframe’s column, and if not, then = 0.

Then I need to check if both X & Z both met those conditions for that row to see if the red path was in-between the two black paths in both dimensions.

I have been trying to implement this for a while but am stuck. This is what I’ve been trying but it’s not working and seems very inefficient:

import pandas as pd
import numpy as np

def CI_analysis(df_H, df_R):
    
    df_H_top_X = df_H.filter(regex='top_X')
    df_H_bottom_X = df_H.filter(regex='bottom_X')
    
    df_H_top_Z = df_H.filter(regex='top_Z')
    df_H_bottom_Z = df_H.filter(regex='bottom_Z')
    
    df_R_X = CI_raycast.filter(regex='mean_X') 
    df_R_Z = CI_raycast.filter(regex='mean_Z') 
    
    CI_inside_X = pd.DataFrame()
    for col in df_R_X:
        temp = []
        c = 0
        for i, val in df_R_X[col].iteritems():
            if (val < df_H_top_X.iloc[i,c]) & (val > df_H_bottom_X.iloc[i,c]):
                temp.append(1)
            else: 
                temp.append(0)
        CI_inside_X[col] = temp
        c = c+1
        
    CI_inside_Z = pd.DataFrame()
    for col in df_R_Z:
        temp = []
        # print(col)
        c = 0
        for i, val in df_R_Z[col].iteritems():
            if (val < df_H_top_Z.iloc[i,c]) & (val > df_H_bottom_Z.iloc[i,c]):
                temp.append(1)
            else: 
                temp.append(0)
        CI_inside_Z[col] = temp
        c = c+1
    
    CI_inside = pd.DataFrame()
    for col in CI_inside_X:
        temp = []
        c = 0
        for i,row in CI_inside_X[col].iteritems(): 
            if (row == 1) & (CI_inside_Z.iloc[i,c] == 1):
                temp.append(1)
            else: 
                temp.append(0)
        CI_inside[col] = temp
        c = c+1
    
    CI_inside_avg = pd.DataFrame(CI_inside.mean(axis=0)).transpose() 
    
    return CI_inside_X, CI_inside_Z, CI_inside, CI_inside_avg  

Lastly, here is code to reproduce the two dataframes df_R & df_H (with random numbers):

df_R_cols = ['(0, 1, 1)_mean_X', '(0, 1, 1)_mean_Z', '(0, 1, 2)_mean_X',
       '(0, 1, 2)_mean_Z', '(0, 1, 3)_mean_X', '(0, 1, 3)_mean_Z',
       '(0, 2, 1)_mean_X', '(0, 2, 1)_mean_Z', '(0, 2, 2)_mean_X',
       '(0, 2, 2)_mean_Z', '(0, 2, 3)_mean_X', '(0, 2, 3)_mean_Z',
       '(1, 1, 1)_mean_X', '(1, 1, 1)_mean_Z', '(1, 1, 2)_mean_X',
       '(1, 1, 2)_mean_Z', '(1, 1, 3)_mean_X', '(1, 1, 3)_mean_Z',
       '(1, 2, 1)_mean_X', '(1, 2, 1)_mean_Z', '(1, 2, 2)_mean_X',
       '(1, 2, 2)_mean_Z', '(1, 2, 3)_mean_X', '(1, 2, 3)_mean_Z',
       '(2, 1, 1)_mean_X', '(2, 1, 1)_mean_Z', '(2, 1, 2)_mean_X',
       '(2, 1, 2)_mean_Z', '(2, 1, 3)_mean_X', '(2, 1, 3)_mean_Z',
       '(2, 2, 1)_mean_X', '(2, 2, 1)_mean_Z', '(2, 2, 2)_mean_X',
       '(2, 2, 2)_mean_Z', '(2, 2, 3)_mean_X', '(2, 2, 3)_mean_Z'] 

df_H_cols = ['(0, 1, 1)_top_X', '(0, 1, 1)_bottom_X', '(0, 1, 1)_top_Z',
       '(0, 1, 1)_bottom_Z', '(0, 1, 2)_top_X', '(0, 1, 2)_bottom_X',
       '(0, 1, 2)_top_Z', '(0, 1, 2)_bottom_Z', '(0, 1, 3)_top_X',
       '(0, 1, 3)_bottom_X', '(0, 1, 3)_top_Z', '(0, 1, 3)_bottom_Z',
       '(0, 2, 1)_top_X', '(0, 2, 1)_bottom_X', '(0, 2, 1)_top_Z',
       '(0, 2, 1)_bottom_Z', '(0, 2, 2)_top_X', '(0, 2, 2)_bottom_X',
       '(0, 2, 2)_top_Z', '(0, 2, 2)_bottom_Z', '(0, 2, 3)_top_X',
       '(0, 2, 3)_bottom_X', '(0, 2, 3)_top_Z', '(0, 2, 3)_bottom_Z',
       '(1, 1, 1)_top_X', '(1, 1, 1)_bottom_X', '(1, 1, 1)_top_Z',
       '(1, 1, 1)_bottom_Z', '(1, 1, 2)_top_X', '(1, 1, 2)_bottom_X',
       '(1, 1, 2)_top_Z', '(1, 1, 2)_bottom_Z', '(1, 1, 3)_top_X',
       '(1, 1, 3)_bottom_X', '(1, 1, 3)_top_Z', '(1, 1, 3)_bottom_Z',
       '(1, 2, 1)_top_X', '(1, 2, 1)_bottom_X', '(1, 2, 1)_top_Z',
       '(1, 2, 1)_bottom_Z', '(1, 2, 2)_top_X', '(1, 2, 2)_bottom_X',
       '(1, 2, 2)_top_Z', '(1, 2, 2)_bottom_Z', '(1, 2, 3)_top_X',
       '(1, 2, 3)_bottom_X', '(1, 2, 3)_top_Z', '(1, 2, 3)_bottom_Z',
       '(2, 1, 1)_top_X', '(2, 1, 1)_bottom_X', '(2, 1, 1)_top_Z',
       '(2, 1, 1)_bottom_Z', '(2, 1, 2)_top_X', '(2, 1, 2)_bottom_X',
       '(2, 1, 2)_top_Z', '(2, 1, 2)_bottom_Z', '(2, 1, 3)_top_X',
       '(2, 1, 3)_bottom_X', '(2, 1, 3)_top_Z', '(2, 1, 3)_bottom_Z',
       '(2, 2, 1)_top_X', '(2, 2, 1)_bottom_X', '(2, 2, 1)_top_Z',
       '(2, 2, 1)_bottom_Z', '(2, 2, 2)_top_X', '(2, 2, 2)_bottom_X',
       '(2, 2, 2)_top_Z', '(2, 2, 2)_bottom_Z', '(2, 2, 3)_top_X',
       '(2, 2, 3)_bottom_X', '(2, 2, 3)_top_Z', '(2, 2, 3)_bottom_Z']

df_R = pd.DataFrame(np.random.randint(0,100,size=(1000, 36)), columns=df_R_cols)
df_H = pd.DataFrame(np.random.randint(0,100,size=(1000, 72)), columns=df_H_cols)


Get this bounty!!!

#StackBounty: #python #pandas #dataframe #numpy Iterate through two dataframes, comparing single column to two other columns

Bounty: 50

In a broad sense, I’m trying to calculate how much of the red path/trajectory falls in-between the black paths for many different trials (see plot below).
I circled a couple examples, where for (0, 1, 3) approx 30 – 40% of the red path falls in-between the two black paths, but for (2, 1, 3) only about 1 – 2% of the red path is in-between the two black paths.

enter image description here

I have two dataframes, df_R & df_H.

df_R contains the position data for the red paths (in X & Z).

Preview of df_R:

    (0, 1, 1)_mean_X  (0, 1, 1)_mean_Z  ...  (2, 2, 3)_mean_X  (2, 2, 3)_mean_Z
0         -15.856713          5.002617  ...        -15.600160         -5.010470
1         -15.831320          5.003529  ...        -15.566172         -5.012251
2         -15.805927          5.004441  ...        -15.532184         -5.014032
3         -15.780534          5.005353  ...        -15.498196         -5.015814
4         -15.755141          5.006265  ...        -15.464208         -5.017595
..               ...               ...  ...               ...               ...
95        -12.818362          5.429729  ...        -12.391177         -5.391595
96        -12.783905          5.437335  ...        -12.357563         -5.396919
97        -12.749456          5.444990  ...        -12.323950         -5.402243
98        -12.715017          5.452697  ...        -12.290336         -5.407567
99        -12.680594          5.460469  ...        -12.256722         -5.412891

df_H contains the position data for the black paths, which includes a ‘top’ and ‘bottom’ column for X and for Z, corresponding to the two black paths in each plot.

Preview of df_H:

    (0, 1, 1)_top_X  (0, 1, 1)_bottom_X  ...  (2, 2, 3)_top_Z  (2, 2, 3)_bottom_Z
0        -16.000000          -16.000000  ...        -5.000000           -5.000000
1        -16.000000          -16.000000  ...        -5.000000           -5.000000
2        -16.000000          -16.000000  ...        -5.000000           -5.000000
3        -16.000000          -16.000000  ...        -5.000000           -5.000000
4        -16.000000          -16.000000  ...        -5.000000           -5.000000
..              ...                 ...  ...              ...                 ...
95       -15.000971          -15.417215  ...        -4.993461           -5.011372
96       -14.979947          -15.402014  ...        -4.993399           -5.013007
97       -14.957949          -15.385840  ...        -4.993291           -5.014463
98       -14.934171          -15.368649  ...        -4.993186           -5.015692
99       -14.908484          -15.349371  ...        -4.993069           -5.016940

For each column in df_R, I need to see whether the X/Z value for that row is less than the top_X/Z and greater than the bottom X/Z value in df_H. If it is, then set that row = 1 in a new dataframe’s column, and if not, then = 0.

Last, I need to check if both X & Z both met those conditions for that row to see if the red path was in-between the two black paths in both dimensions.

I have been trying to implement this for a while but am stuck. This is what I’ve been trying but it’s not working and seems very inefficient:

import pandas as pd
import numpy as np

def CI_analysis(df_H, df_R):
    
    df_H_top_X = df_H.filter(regex='top_X')
    df_H_bottom_X = df_H.filter(regex='bottom_X')
    
    df_H_top_Z = df_H.filter(regex='top_Z')
    df_H_bottom_Z = df_H.filter(regex='bottom_Z')
    
    df_R_X = CI_raycast.filter(regex='mean_X') 
    df_R_Z = CI_raycast.filter(regex='mean_Z') 
    
    CI_inside_X = pd.DataFrame()
    for col in df_R_X:
        temp = []
        c = 0
        for i, val in df_R_X[col].iteritems():
            if (val < df_H_top_X.iloc[i,c]) & (val > df_H_bottom_X.iloc[i,c]):
                temp.append(1)
            else: 
                temp.append(0)
        CI_inside_X[col] = temp
        c = c+1
        
    CI_inside_Z = pd.DataFrame()
    for col in df_R_Z:
        temp = []
        # print(col)
        c = 0
        for i, val in df_R_Z[col].iteritems():
            if (val < df_H_top_Z.iloc[i,c]) & (val > df_H_bottom_Z.iloc[i,c]):
                temp.append(1)
            else: 
                temp.append(0)
        CI_inside_Z[col] = temp
        c = c+1
    
    CI_inside = pd.DataFrame()
    for col in CI_inside_X:
        temp = []
        c = 0
        for i,row in CI_inside_X[col].iteritems(): 
            if (row == 1) & (CI_inside_Z.iloc[i,c] == 1):
                temp.append(1)
            else: 
                temp.append(0)
        CI_inside[col] = temp
        c = c+1
    
    CI_inside_avg = pd.DataFrame(CI_inside.mean(axis=0)).transpose() 
    
    return CI_inside_X, CI_inside_Z, CI_inside, CI_inside_avg  

Lastly, here is code to reproduce the two dataframes df_R & df_H (with random numbers):

df_R_cols = ['(0, 1, 1)_mean_X', '(0, 1, 1)_mean_Z', '(0, 1, 2)_mean_X',
       '(0, 1, 2)_mean_Z', '(0, 1, 3)_mean_X', '(0, 1, 3)_mean_Z',
       '(0, 2, 1)_mean_X', '(0, 2, 1)_mean_Z', '(0, 2, 2)_mean_X',
       '(0, 2, 2)_mean_Z', '(0, 2, 3)_mean_X', '(0, 2, 3)_mean_Z',
       '(1, 1, 1)_mean_X', '(1, 1, 1)_mean_Z', '(1, 1, 2)_mean_X',
       '(1, 1, 2)_mean_Z', '(1, 1, 3)_mean_X', '(1, 1, 3)_mean_Z',
       '(1, 2, 1)_mean_X', '(1, 2, 1)_mean_Z', '(1, 2, 2)_mean_X',
       '(1, 2, 2)_mean_Z', '(1, 2, 3)_mean_X', '(1, 2, 3)_mean_Z',
       '(2, 1, 1)_mean_X', '(2, 1, 1)_mean_Z', '(2, 1, 2)_mean_X',
       '(2, 1, 2)_mean_Z', '(2, 1, 3)_mean_X', '(2, 1, 3)_mean_Z',
       '(2, 2, 1)_mean_X', '(2, 2, 1)_mean_Z', '(2, 2, 2)_mean_X',
       '(2, 2, 2)_mean_Z', '(2, 2, 3)_mean_X', '(2, 2, 3)_mean_Z'] 

df_H_cols = ['(0, 1, 1)_top_X', '(0, 1, 1)_bottom_X', '(0, 1, 1)_top_Z',
       '(0, 1, 1)_bottom_Z', '(0, 1, 2)_top_X', '(0, 1, 2)_bottom_X',
       '(0, 1, 2)_top_Z', '(0, 1, 2)_bottom_Z', '(0, 1, 3)_top_X',
       '(0, 1, 3)_bottom_X', '(0, 1, 3)_top_Z', '(0, 1, 3)_bottom_Z',
       '(0, 2, 1)_top_X', '(0, 2, 1)_bottom_X', '(0, 2, 1)_top_Z',
       '(0, 2, 1)_bottom_Z', '(0, 2, 2)_top_X', '(0, 2, 2)_bottom_X',
       '(0, 2, 2)_top_Z', '(0, 2, 2)_bottom_Z', '(0, 2, 3)_top_X',
       '(0, 2, 3)_bottom_X', '(0, 2, 3)_top_Z', '(0, 2, 3)_bottom_Z',
       '(1, 1, 1)_top_X', '(1, 1, 1)_bottom_X', '(1, 1, 1)_top_Z',
       '(1, 1, 1)_bottom_Z', '(1, 1, 2)_top_X', '(1, 1, 2)_bottom_X',
       '(1, 1, 2)_top_Z', '(1, 1, 2)_bottom_Z', '(1, 1, 3)_top_X',
       '(1, 1, 3)_bottom_X', '(1, 1, 3)_top_Z', '(1, 1, 3)_bottom_Z',
       '(1, 2, 1)_top_X', '(1, 2, 1)_bottom_X', '(1, 2, 1)_top_Z',
       '(1, 2, 1)_bottom_Z', '(1, 2, 2)_top_X', '(1, 2, 2)_bottom_X',
       '(1, 2, 2)_top_Z', '(1, 2, 2)_bottom_Z', '(1, 2, 3)_top_X',
       '(1, 2, 3)_bottom_X', '(1, 2, 3)_top_Z', '(1, 2, 3)_bottom_Z',
       '(2, 1, 1)_top_X', '(2, 1, 1)_bottom_X', '(2, 1, 1)_top_Z',
       '(2, 1, 1)_bottom_Z', '(2, 1, 2)_top_X', '(2, 1, 2)_bottom_X',
       '(2, 1, 2)_top_Z', '(2, 1, 2)_bottom_Z', '(2, 1, 3)_top_X',
       '(2, 1, 3)_bottom_X', '(2, 1, 3)_top_Z', '(2, 1, 3)_bottom_Z',
       '(2, 2, 1)_top_X', '(2, 2, 1)_bottom_X', '(2, 2, 1)_top_Z',
       '(2, 2, 1)_bottom_Z', '(2, 2, 2)_top_X', '(2, 2, 2)_bottom_X',
       '(2, 2, 2)_top_Z', '(2, 2, 2)_bottom_Z', '(2, 2, 3)_top_X',
       '(2, 2, 3)_bottom_X', '(2, 2, 3)_top_Z', '(2, 2, 3)_bottom_Z']

df_R = pd.DataFrame(np.random.randint(0,100,size=(1000, 36)), columns=df_R_cols)
df_H = pd.DataFrame(np.random.randint(0,100,size=(1000, 72)), columns=df_H_cols)


Get this bounty!!!

#StackBounty: #python #pandas #numpy #tensorflow Is it possible to pass a dataframe to TF/Keras that has a numpy array for each row?

Bounty: 50

I’m doing a regression that is working but to improve results I wanted to add a numpy array (it represents user attributes that I preprocessed outside the application).

Here’s a example of my data:

MPG Cylinders   Displacement    Horsepower  Weight  Acceleration    Model Year  Origin  NumpyColumn
0   18.0    8   307.0   130.0   3504.0  12.0    70  1   [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1   15.0    8   350.0   165.0   3693.0  11.5    70  1   [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2   18.0    8   318.0   150.0   3436.0  11.0    70  1   [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3   16.0    8   304.0   150.0   3433.0  12.0    70  1   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4   17.0    8   302.0   140.0   3449.0  10.5    70  1   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
... ... ... ... ... ... ... ... ... ...
393 27.0    4   140.0   86.0    2790.0  15.6    82  1   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
394 44.0    4   97.0    52.0    2130.0  24.6    82  2   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
395 32.0    4   135.0   84.0    2295.0  11.6    82  1   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
396 28.0    4   120.0   79.0    2625.0  18.6    82  1   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
397 31.0    4   119.0   82.0    2720.0  19.4    82  1   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Here’s how to generate it:

import numpy as np
import pandas as pd
import scipy.sparse as sparse

#download data
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']
df = pd.read_csv(url, names=column_names,
          na_values='?', comment='t',
          sep=' ', skipinitialspace=True)

lenOfDF = (len(df))
#add numpy array
arr = sparse.coo_matrix(([1,1,1], ([0,1,2], [1,2,0])), shape=(lenOfDF,lenOfDF))
df['NumpyColumn'] = arr.toarray().tolist()

Then my model is similar to this:

g_input = Input(shape=[Xtrain.shape[1]])
H1 = Dense(512)(g_input)
H1r = Activation('relu')(H1)
H2 = Dense(256)(H1r)
H2r = Activation('relu')(H2)
H3 = Dense(256)(H2r)
H3r = Activation('relu')(H3)
H4 = Dense(128)(H3r)
H4r = Activation('relu')(H4)
H5 = Dense(128)(H4r)

H5r = Activation('relu')(H5)
H6 = Dense(64)(H5r)
H6r = Activation('relu')(H6)
H7 = Dense(32)(H6r)
Hr = Activation('relu')(H7)
g_V = Dense(1)(Hr)

generator = Model(g_input,g_V)
generator.compile(loss='binary_crossentropy', optimizer=opt)

When I call it using the dataset with the NumpyColumn(x_batch is just a split and scaled dataset of above dataframe with the numpy array passed through so it remains unchanged). I get the following error:

# generated = generator.predict(x_batch)                            #making prediction from the generator
generated = generator.predict(tf.convert_to_tensor(x_batch))      #making prediction from the generator

Error:

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).

What am I doing wrong here? My thought is that having a array would provide the model information to make better prediction so I’m trying to test it. Is it possible to add a numpy array to a dataframe to train? Or is there an alternative approach I should be doing?

Many thanks in advance!

edit1: Above is a sample to quickly help you understand the problem. In my case after encoding/scaling the dataframe, I have a numpy array that looks like this(it’s numeric representing the catergorical encodings + two numpy arrays at the end):

array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 9921.0,
       20.0, 0.40457918757980704, 0.11369258150627903, 0.868421052631579,
       0.47368421052631576, 0.894736842105263, 0.06688034531010473,
       0.16160188713280013, 0.7368421052631579, 0.1673332894736842,
       0.2099143206854345, 0.3690644464300929, 0.07097828135799109,
       0.8157894736842104, 0.9210526315789473, 0.23091420289239645,
       0.08623506024464939, 0.5789473684210527, 0.763157894736842, 0.0,
       0.18421052631578946, 0.07949239000059796, 0.18763907099960708,
       0.7368421052631579, 0.2668740256483197, 0.6842105263157894,
       0.13699219747488295, 0.868421052631579, 0.868421052631579,
       0.052631349139178094, 0.6842105263157894, 0.5526315789473684,
       0.6842105263157894, 0.6842105263157894, 0.6842105263157894,
       0.7105263157894737, 0.7105263157894737, 0.7105263157894737,
       0.23684210526315788, 0.0, 0.7105263157894737, 0.5789473684210527,
       0.763157894736842, 0.5263157894736842, 0.6578947368421052,
       0.6842105263157894, 0.7105263157894737, 0.0, 0.5789473684210527,
       0.2631578947368421, 0.6842105263157894, 0.6578947368421052,
       0.42105263157894735, 0.5789473684210527, 0.42105263157894735,
       0.7368421052631579, 0.7368421052631579, 0.15207999030227856,
       0.8445892232119124, 0.2683721567016762, 0.3142850329243405,
       0.18421052631578946, 0.19132292433056333, 0.20615136344079915,
       0.14475710664724623, 0.1624920232728424, 0.6989826700898587,
       0.18421052631578946, 0.21052631578947367, 0.4793448772543646,
       0.7894736842105263, 0.682967263567459, 0.37139592674256894,
       0.21123755190149363, 0.18421052631578946, 0.6578947368421052,
       0.39473684210526316, 0.631578947368421, 0.7894736842105263,
       0.36842105263157887, 0.1863353145721346, 0.7368421052631579,
       0.26809396092240706, 0.22492185003691062, 0.1460488284639197,
       0.631578947368421, 0.15347526114630458, 0.763157894736842,
       0.2097323620058104, 0.3684210526315789, 0.631578947368421,
       0.631578947368421, 0.631578947368421, 0.6842105263157894,
       0.36842105263157887, 0.10507952765043811, 0.22418515695024185,
       0.23755698619020282, 0.22226500126902, 0.530004040377794,
       0.3421052631578947, 0.19018711711349692, 0.19629244102133708,
       0.5789473684210527, 0.10526315789473684, 0.49999999999999994,
       0.5263157894736842, 0.5263157894736842, 0.49999999999999994,
       0.1052631578947368, 0.10526315789473678, 0.5263157894736842,
       0.4736842105263157, 2013.0,
       array([0.        , 0.        , 0.        , 0.62235785, 0.        ,
       0.27049118, 0.        , 0.31094068, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.4330532 , 0.        ,
       0.        , 0.2515796 , 0.        , 0.        , 0.        ,
       0.40683705, 0.01569915, 0.        , 0.        , 0.        ,
       0.13090582, 0.        , 0.49955425, 0.06970194, 0.29155406,
       0.        , 0.        , 0.27342197, 0.        , 0.        ,
       0.        , 0.04415211, 0.        , 0.03908829, 0.        ,
       0.07673171, 0.33199945, 0.        , 0.51759815, 0.        ,
       0.4719149 , 0.4538082 , 0.13475986, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.08000553,
       0.        , 0.02991109, 0.        , 0.5051543 , 0.        ,
       0.24663273, 0.        , 0.50839704, 0.        , 0.        ,
       0.05281948, 0.44884402, 0.        , 0.44542992, 0.15376966,
       0.        , 0.        , 0.        , 0.39128256, 0.49497205,
       0.        , 0.        ], dtype=float32),
       array([0.        , 0.        , 0.        , 0.62235785, 0.        ,
       0.27049118, 0.        , 0.31094068, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.4330532 , 0.        ,
       0.        , 0.25157961, 0.        , 0.        , 0.        ,
       0.40683705, 0.01569915, 0.        , 0.        , 0.        ,
       0.13090582, 0.        , 0.49955425, 0.06970194, 0.29155406,
       0.        , 0.        , 0.27342197, 0.        , 0.        ,
       0.        , 0.04415211, 0.        , 0.03908829, 0.        ,
       0.07673171, 0.33199945, 0.        , 0.51759815, 0.        ,
       0.47191489, 0.45380819, 0.13475986, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.08000553,
       0.        , 0.02991109, 0.        , 0.50515431, 0.        ,
       0.24663273, 0.        , 0.50839704, 0.        , 0.        ,
       0.05281948, 0.44884402, 0.        , 0.44542992, 0.15376966,
       0.        , 0.        , 0.        , 0.39128256, 0.49497205,
       0.        , 0.        ])], dtype=object)


Get this bounty!!!

#StackBounty: #python #numpy #genetic-algorithm numpy genetic algorithm selection: roulette wheel vs. stochastic universal sampling

Bounty: 250

I am implementing a genetic algorithm in numpy and I’m trying to figure out how to correctly implement selection via roulette wheel and stochastic universal sampling. The examples I’ve seen on stackoverflow or elsewhere use a python loop rather than vectorized numpy code.

For example, here are the implementations of both algorithms in DEAP.

def selRoulette(individuals, k, fit_attr="fitness"):
    """Select *k* individuals from the input *individuals* using *k*
    spins of a roulette. The selection is made by looking only at the first
    objective of each individual. The list returned contains references to
    the input *individuals*.
    :param individuals: A list of individuals to select from.
    :param k: The number of individuals to select.
    :param fit_attr: The attribute of individuals to use as selection criterion
    :returns: A list of selected individuals.
    This function uses the :func:`~random.random` function from the python base
    """
    s_inds = sorted(individuals, key=attrgetter(fit_attr), reverse=True)
    sum_fits = sum(getattr(ind, fit_attr).values[0] for ind in individuals)
    chosen = []
    for i in xrange(k):
        u = random.random() * sum_fits
        sum_ = 0
        for ind in s_inds:
            sum_ += getattr(ind, fit_attr).values[0]
            if sum_ > u:
                chosen.append(ind)
                break

    return chosen

def selStochasticUniversalSampling(individuals, k, fit_attr="fitness"):
    """Select the *k* individuals among the input *individuals*.
    The selection is made by using a single random value to sample all of the
    individuals by choosing them at evenly spaced intervals. The list returned
    contains references to the input *individuals*.
    :param individuals: A list of individuals to select from.
    :param k: The number of individuals to select.
    :param fit_attr: The attribute of individuals to use as selection criterion
    :return: A list of selected individuals.
    """
    s_inds = sorted(individuals, key=attrgetter(fit_attr), reverse=True)
    sum_fits = sum(getattr(ind, fit_attr).values[0] for ind in individuals)

    distance = sum_fits / float(k)
    start = random.uniform(0, distance)
    points = [start + i*distance for i in xrange(k)]

    chosen = []
    for p in points:
        i = 0
        sum_ = getattr(s_inds[i], fit_attr).values[0]
        while sum_ < p:
            i += 1
            sum_ += getattr(s_inds[i], fit_attr).values[0]
        chosen.append(s_inds[i])

    return chosen

Here is my implementation of roulette wheel, which seems to be weighted sampling with replacement, but I’m not sure about the replacement parameter.

# population is a 2D array of integers
# population_fitness is a 1D array of float of same length as population

weights = population_fitness / population_fitness.sum()
selected = population[np.random.choice(len(population), size=n, replace=True, p=weights)]       

And here is my implementation of SUS selection. Am I correct that, when implemented in numpy, the only thing I have to change is that sampling is without replacement, or should I also remove the weights?

weights = population_fitness / population_fitness.sum()
selected = population[np.random.choice(len(population), size=n, replace=False, p=weights)]       

Thanks for any suggestions!


Get this bounty!!!