#StackBounty: #statistical-significance #python #p-value #kolmogorov-smirnov #scipy ks_2samp test in Python scipy – low D statistic, lo…

Bounty: 50

As the heading says, I’m getting both D statistic and p-value to be low in ks_2samp test. More specificaly:

Ks_2sampResult(statistic=0.049890046265079313, pvalue=0.0011365796735152277)

I think these two results seem kind of contradictory. If the absolute difference between the two CDFs is 0.05, I would say they are mostly the same distribution and it’s quite unintuitive and strange for me to see such a low p-value.

The sample size for both of my variables are over 1500. Both of them have range [0,1]. Now, I have found this post.

It seems that the p-value and D statistic both decrease as the size of the sample increases. This creates concerns for me about using this method for testing if two distributions are the same. I would like to hear more opinions about this, as I am pretty convinced now that this should not be trusted in my case. But if it is true that it’s misleading here, then why should I trust it in any case?


Get this bounty!!!

Distributed Evolutionary Algorithms in Python

DEAP

DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. It seeks to make algorithms explicit and data structures transparent. It works in perfect harmony with parallelization mechanism such as multiprocessing and SCOOP.

DEAP includes the following features:

  • Genetic algorithm using any imaginable representation
    • List, Array, Set, Dictionary, Tree, Numpy Array, etc.
  • Genetic programing using prefix trees
    • Loosely typed, Strongly typed
    • Automatically defined functions
  • Evolution strategies (including CMA-ES)
  • Multi-objective optimisation (NSGA-II, SPEA2, MO-CMA-ES)
  • Co-evolution (cooperative and competitive) of multiple populations
  • Parallelization of the evaluations (and more)
  • Hall of Fame of the best individuals that lived in the population
  • Checkpoints that take snapshots of a system regularly
  • Benchmarks module containing most common test functions
  • Genealogy of an evolution (that is compatible with NetworkX)
  • Examples of alternative algorithms : Particle Swarm Optimization, Differential Evolution, Estimation of Distribution Algorithm

See the DEAP User’s Guide for DEAP documentation.

Installation

We encourage you to use easy_install or pip to install DEAP on your system. Other installation procedure like apt-get, yum, etc. usually provide an outdated version.

pip install deap

The latest version can be installed with

pip install git+https://github.com/DEAP/deap@master

If you wish to build from sources, download or clone the repository and type

python setup.py install

 

Source

#StackBounty: #python #numpy #matplotlib #scipy #statistics scipy.stats.binned_statistic_2d works for count but not mean

Bounty: 50

I have some satellite data which looks like the following (scatter plot):

Night-time Ion Density

I now want to bin this data into a regular grid over time and latitude and have each bin be equal to the mean of the all the data points that fall within it. I have been experimenting with scipy.stats.binned_statistic_2d and am baffled at the results I am getting.

First, if I pass the “count” statistic into the scipy binning function, it appears to work correctly (minimal code and plot below).

id1 = np.ma.masked_where(id1==0, id1) #id1 is the actual data and I have tried using this masking argument and without to the same effect

x_range = np.arange(0,24.25,.25) #setting grid spacing for x and y
y_range = np.arange(-13,14,1)

xbins, ybins = len(x_range), len(y_range) #number of bins in each dimension

H, xedges, yedges, binnumber = stats.binned_statistic_2d(idtime, idlat, values = id1, statistic='count' , bins = [xbins, ybins])  #idtime and idlat are the locations of each id1 value in time and latitude
H = np.ma.masked_where(H==0, H) #masking where there was no data
XX, YY = np.meshgrid(xedges, yedges)

fig = plt.figure(figsize = (13,7))
ax1=plt.subplot(111)
plot1 = ax1.pcolormesh(XX,YY,H.T)

Resulting Plot

Counts

Now if I change the statistic to mean, np.mean, np.ma.mean, etc… this is the plot I get which appears to pick out places there is data and where there is none:

Mean

Even though the min and max values for this data are 612 and 2237026 respectively. I have written some code that does this manually, but it isn’t pretty and takes forever (and I haven’t completely accounted for edge effects so running to error and then fixing it is taking forever).

I would love some advice to get this to work. Thanks!

Edit: I just noticed that I am getting a runtime warning after running the script which I can’t find any information about online. A google search for the warning returns zero results. The warning occurs for every statistic option except for count.

AppDataLocalEnthoughtCanopyedmenvsUserlibsite-packagesmatplotlibcolors.py:494:
RuntimeWarning: invalid value encountered in less cbook._putmask(xa,
xa < 0.0, -1)

Edit2: I am attaching some code below that duplicates my problem. This code works for the statistic count but not for mean or any other statistic. This code produces the same run time warning from before in the same manner.

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

x = np.random.rand(1000)
y = np.random.rand(1000)

z = np.arange(1000)

H, xedges, yedges, binnumber = stats.binned_statistic_2d(x, y, values = z, statistic='count' , bins = [20, 20])
H2, xedges2, yedges2, binnumber2 = stats.binned_statistic_2d(x, y, values = z, statistic='mean' , bins = [20, 20])

XX, YY = np.meshgrid(xedges, yedges)
XX2, YY2 = np.meshgrid(xedges2, yedges2)

fig = plt.figure(figsize = (13,7))
ax1=plt.subplot(111)
plot1 = ax1.pcolormesh(XX,YY,H.T)
cbar = plt.colorbar(plot1,ax=ax1, pad = .015, aspect=10)
plt.show()

fig2 = plt.figure(figsize = (13,7))
ax2=plt.subplot(111)
plot2 = ax2.pcolormesh(XX2,YY2,H2.T)
cbar = plt.colorbar(plot2,ax=ax2, pad = .015, aspect=10)
plt.show()

count_working_code
mean_working_code

Edit 3: User8153 was able to identify the problem. The solution was to mask the array from scipy stats where nans occur. I used np.ma.masked_invalid() to do this. Plots of my original data and test data are below for the mean statistic.

Working Mean My Data
Working Mean Sample Data


Get this bounty!!!

#StackBounty: #python-2.7 #scipy #mathematical-optimization Setting up scipy optimize for equality in dual solution

Bounty: 200

I have a data as:

data = [1.55, 13, 19, 19, 5.5]

for which I need to find the multipliers xi for i = 0, 1, 2, 3, 4

such that:

data2 = [x0*1.55, x1*13, x2*19, x3*19, x4*5.5]
(data2[0] + .. + data2[4])/len(data2) = 11.9 #--constraint 1
( MAX(data2[0], 13) + ... + MAX(data2[4], 13) )/len(data2) = 2.6 #--constraint 2
x0 > 0; x1 > 0; ..., x4 > 0 #--constraint 3
x0 + x1 + x2 + x3 + x4 = 3 #--constraint 4

I am trying to implement the dual solution available in this paper

Following is an excerpt from the paper which I am failing to implement:
enter image description here

I am able to set up the primal solution but because of way H and x are defined, I am not able to set this particular implementation from paper. H depends on constraint 2 which has x in it and x then depends on H based on equation 90.

Edit:
Edited the problem based on comments below.

Edit2:
Please feel free to set up the dual problem in your way, if you think my interpretation is incorrect. My only goal is to implement the dual for my particular optimization.


Get this bounty!!!

#StackBounty: #python #numerical-methods #scipy #sympy Fitting multiple piecewise functions to data and return functions and derivative…

Bounty: 50

Background

For a future workshop I’ll have to fit arbitrary functions (independent variable is height z) to data from multiple sources (output of different numerical weather prediction models) in a yet unknown format (but basically gridded height/value pairs). The functions only have to interpolate the data and be differentiable. There should explicitly be no theoretical background for the type of function, but they should be smooth. The goal is to use the gridded (meaning discrete) output of the numerical weather prediction model in our pollutant dispersion model, which requires continuous functions.

Workflow

  1. choose the input model
  2. load input data
  3. define list of variables (not necessarily always the same)
  4. define height ranges (for the piecewise function)
  5. define base functions like “a0 + a1*z” for each height range and variable
  6. optionally define weights, because some parts are more important that others
  7. fit the piecewise functions
  8. save the fitted functions and their derivatives as Fortran 90 free form source code (to be included in our model)

I don’t think 1.-6. can be automated, but the rest should be.

Code

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
from sympy import log, ln, Piecewise, lambdify, symbols, sympify, fcode, sqrt


def config(name):
    """Configuration of the piecewise function fitting, dependent on input name

    Input:
    name... name of experiment to fit data to, basically chooses settings
    Output:
    var_list... list of variables to fit
    infunc_list_dict... dictionary with var_list as keys, each having a list as
        value that contains strings with the sub-function to fit, from
        the bottom up. Only the first (lowest) may have a constant value, all
        others must be 0 at the height they "take over" (where their argument
        is 0). There, the value of the lower, fitted function is added to
        ensure continuity. The parameters for each function HAVE to be of the
        pattern "aX", where "X" is numerically increasing (0, 1, 2...) within
        each sub-function.
        The arguments of aloft functions (not the bottom most) are usually
        "z - t", unless there is some trickery with "s"
        A constant, first sub-function is 'a0', while constant sub-function
        aloft has to be '0' for technical reasons.
        Variables replaced by values:
            - t... current threshold height
            - s... transition value at height t
            - zi.. bounday layer height
    thresh_list_dict... dictionary with var_list as keys, each having a list as
        value that contains the height where the piecewise functions change.
        for technical reasons the ground (0) and the top (np.inf) are also
        included.
    weight_list_dict... dictionary with var_list as keys, each having a list as
        value that contains relative weights (to 1) that are used to force the
        fitting to be closer to the real value at crucial points. This is
        around the threshold heights, at the ground and at the ABL. To "turn
        off" a weight, set it to 1. The first weight is at the ground and then
        there are two around each treshold height and the last at the top.
        i.e: [ground,
            lower-of-thresh0, upper-of-thresh0,
            lower-of-thresh1, upper-of-thresh1,
            ...
            top]
        the first function uses ground and lower-of-thresh0,
        the second uses upper-of-thresh0 and  lower-of-thresh1 until
        the last uses lower-of-threshI and top
    wefact_list_dict... analog to weight_list_dict, except that it contains
        the relative distance where the weight in weight_list_dict is applied.
        Relative distance means here: fraction of the total subrange. Typical
        values are 0.1 or 0.2, meaning 10 or 20% of the total subrange take the
        accompanying weight. If the corresponding weight equals 1, the value
        has no influence.
    teston... True: create plots; False: don't
    saveon... True: don't show plots, save them as pdfs (only if teston==True).
    printon... True: print output to console; False: don't
    """
    teston = True
    saveon = False
    printon = False

    # ========= TMP220 =========
    if name == 'tmp220':
        abl_height = 990
        var_list = ['um', 'u2', 'v2', 'w2', 'w3', 'uw', 'eps']
        infunc_list_dict = {
            'um': ['a0*ln(z-t)**3 + a1*ln(z-t)**2 + a2*ln(z-t) + a3'],
            'u2': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2'],
            'v2': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2'],
            'w2': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2'],
            'w3': ['a0',
                '0'],
            'uw': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2 + a2*(z-t)**3 + a3*(z-t)**4'],
            'eps': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                    'a0*(z-t)**a1 + a2*(z-t)**3 + a3*(z-t)**2 + a4*(z-t)**4 + a5*(z-t)**6'],
            }
        thresh_list_dict = {
            'um': [0.0, np.inf],
            'u2': [0.0, 12.5, np.inf],
            'v2': [0.0, 12.5, np.inf],
            'w2': [0.0, 12.5, np.inf],
            'w3': [0.0, 12.5, np.inf],
            'uw': [0.0, 12.5, np.inf],
            'eps': [0.0, 12.5, np.inf],
            }
        weight_list_dict = {
            'um': [100, 1],
            'u2': [100, 5000, 1, 1],
            'v2': [100, 5000, 1, 1],
            'w2': [100, 5000, 1, 1],
            'w3': [100, 5000, 1, 1],
            'uw': [100, 5000, 1, 1],
            'eps': [100, 5000, 1, 1],
            }
        wefact_list_dict = {
            'um': [0.2, 0.1],
            'u2': [0.2, 0.2, 0.1, 0.1],
            'v2': [0.2, 0.2, 0.1, 0.1],
            'w2': [0.2, 0.2, 0.1, 0.1],
            'w3': [0.2, 0.2, 0.1, 0.1],
            'uw': [0.2, 0.2, 0.1, 0.1],
            'eps': [0.2, 0.2, 0.1, 0.1],
            }
    #elif name == 'SOMETHING ELSE': analog to above, omitted for brevity
    else:
        raise ValueError('Unsupported name, configure in config()')

    return (var_list, abl_height, infunc_list_dict, thresh_list_dict,
            weight_list_dict, wefact_list_dict, teston, saveon, printon)


def read_scm_data(name_str):
    """This routines reads in the profiles from the SCMs

    Input: # TODO (depends on their format), for now dummy data
    Output: dataframe: z, u2, v2, w2, w3, uw, um, eps
    """
    # TODO: add actual read routine, this is just dummy input
    if name_str == 'tmp220':
        out = pd.read_csv('tmp220.csv', delimiter=',')
    #elif name_str == 'SOMETHING ELSE': as above, omitted for brevity
    else:
        raise ValueError('Unknown name, configure in read_scm_data()')
    return out


def test_fit(name, var_list, func_dict, data, saveon):
    """plot of data vs fitted functions
    """
    # Omitted for brevity, not that relevant


def fit_func(var, abl_height, data_z, data_v, infunc_str_list,
            thresh_list, weight_list, wefact_list):
    """Converts the piecewise defined functions in infunc_str_list with the
    thresholds in thresh_list (and the weights defined by weight_list and
    wefact_list) to a SymPy expression and fits it to (data_z, data_v), where
    data_z is height and data_v are the values in each height. Returns the
    piecewise SymPy function with substituded parameters.
    """
    z = symbols('z')
    y_list = []  # holds the subfunctions
    niterations = 20000
    # transition_value holds the value that is added to each sub-function
    # to ensure a continuous function. this is obviously 0 for the first
    # subfunction and equal to the value of the previous sub-function at the
    # threshold height for each subsequent sub-function.
    transition_value = 0

    # for each piece of the function:
    for i, func_str in enumerate(infunc_str_list):
        # find number of parameters and create those SymPy objects
        nparams = func_str.count('a')
        a = symbols('a0:%d' % nparams)
        t = symbols('t')  # transition height
        s = symbols('s')  # transition value
        zi = symbols('zi')  # boundary layer height

        # check the string and create the sympy expression
        verify_func_str(var, func_str)
        y_list.append(sympify(func_str))

        # add the transition value and substitute the placeholder variables:
        y_list[i] += transition_value
        y_list[i] = y_list[i].subs(t, thresh_list[i])
        y_list[i] = y_list[i].subs(s, transition_value)
        y_list[i] = y_list[i].subs(zi, abl_height)

        # lambdify the sympy-expression with a somewhat ugly hack:
        t = [z]
        for j in range(nparams):
            t.append(a[j])
        func = lambdify(tuple(t), y_list[i], modules=np)

        # create the correction subset of the data
        local_index = data_z > thresh_list[i] & data_z < thresh_list[i + 1]
        local_z = data_z[local_index]
        local_v = data_v[local_index]

        # create the weight arrays. they have the same size as the local_z and
        # are 1 everywhere except the range defined with wefact, where they
        # are the specified weight. see config() for definitions.
        weight = np.ones_like(local_z)
        z_range = local_z[-1] - local_z[0]
        lower_weight_lim = local_z[0] + wefact_list[2*i] * z_range
        upper_weight_lim = local_z[-1] - wefact_list[2*i + 1] * z_range
        weight[local_z < lower_weight_lim] = weight_list[2*i]
        weight[local_z > upper_weight_lim] = weight_list[2*i + 1]
        sigma = 1. / weight

        # fit the function to the data, checking for constant function aloft:
        if nparams > 0:
            popt, pcov = curve_fit(func, local_z, local_v, sigma=sigma,
                                maxfev=niterations)

        # substitute fitted parameters in sympy expression:
        for j in range(nparams):
            y_list[i] = y_list[i].subs(a[j], popt[j])

        # calculate the new transition_value:
        if nparams > 0:
            transition_value = func(thresh_list[i + 1], *popt)
        else:
            transition_value = func(thresh_list[i + 1])

    # After all sub-functions are fitted, combine them to a piecewise function.
    # This is a terrible hack, but I couldn't find out how to create piecewise
    # functions dynamically...
    if len(y_list) == 1:
        y = y_list[0]
    elif len(y_list) == 2:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], True))
    elif len(y_list) == 3:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], z <= thresh_list[2]),
                    (y_list[2], True))
    elif len(y_list) == 4:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], z <= thresh_list[2]),
                    (y_list[2], z <= thresh_list[3]),
                    (y_list[3], True))
    elif len(y_list) == 5:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], z <= thresh_list[2]),
                    (y_list[2], z <= thresh_list[3]),
                    (y_list[3], z <= thresh_list[4]),
                    (y_list[4], True))
    else:
        raise ValueError('More than five sub-functions not implemented yet')
    return y


def create_deriv(funcname, func):
    """Creates the derivative of the function, taking into account that v2 has
    two "derivatives".
    careful: returns tuple of two functions if funcname==v2, else one function
    """
    z = symbols('z')
    if funcname != 'v2':
        return func.diff(z)
    else:
        deriv = func.diff(z)
        deriv_sig = sqrt(func).diff(z)
        return (deriv, deriv_sig)


def verify_input(name, infunc_list, thresh_list,
                weight_list, wefact_list):
    """rudimentary checks if the functions, weights and thresholds are faulty
    """
    nfuncs = len(infunc_list)
    if len(thresh_list) != nfuncs + 1:
        raise ValueError('Number of functions and thresholds disagree for ' +
                        var + ' of ' + name)
    if len(weight_list) != nfuncs * 2:
        raise ValueError('Number of functions and weights disagree for ' +
                        var + ' of ' + name)
    if len(wefact_list) != nfuncs * 2:
        raise ValueError('Number of functions and weight factors disagree' +
                        ' for ' + var + ' of ' + name)


def verify_func_str(var, func_str):
    """Checks if the function string has linearly increasing parameters,
    starting with 0 (i.e. "a0, a1, a2..."), because otherwise there is only
    a cryptic error in minpack.
    """
    index_list = []
    for c, char in enumerate(func_str):
        if char == 'a':
            index_list.append(int(func_str[c+1]))
    if list(range(0, len(index_list))) != index_list:
        raise ValueError(func_str + ' has non-monotonically increasing' +
                        'parameter indices or does not start with a0' +
                        ' in variable ' + var)


def main(name, var_list, abl_height, infunc_list_dict, thresh_list_dict,
        weight_list_dict, wefact_list_dict, teston, saveon, printon):
    """Start routines, print output (if printon), save Fortran functions in a
    file and start testing (if teston)
    """
    func_dict, deri_dict = {}, {}
    # file_str stores everything that is written to file in one string
    file_str = '!' + 78*'=' + 'n' + '!     ' + name + 'n!' + 78*'=' + 'n'
    if printon:
        print(' ' + 78*'_' + ' ')
        print('|' + 78*' ' + '|')
        print('|' + 30*' ' + name + (48-len(name))*' ' + '|')
        print('|' + 78*' ' + '|')
    data = read_scm_data(name)
    for var in var_list:
        verify_input(name, infunc_list_dict[var], thresh_list_dict[var],
                    weight_list_dict[var], wefact_list_dict[var])
        if printon:
            print('! ----- ' + var)
        file_str += '! ----- ' + var + 'n'
        # use data.z.values to get rid of the pandas-overhead and because
        # some stuff is apparently not possible otherwise (like data.z[-1])
        func_dict[var] = fit_func(var, abl_height, data.z.values,
                                data[var].values, infunc_list_dict[var],
                                thresh_list_dict[var], weight_list_dict[var],
                                wefact_list_dict[var])
        func_fstr = fcode(func_dict[var], source_format='free', assign_to=var,
                        standard=95)
        if printon:
            print(func_fstr)
        file_str += func_fstr + 'n'
        if var != 'v2':
            deri_dict[var] = create_deriv(var, func_dict[var])
            deri_fstr = fcode(deri_dict[var], source_format='free',
                            assign_to='d'+var)
            if printon:
                print(deri_fstr)
            file_str += deri_fstr + 'nn'
        else:
            deri_dict[var], deri_dict['sigv'] = create_deriv(var,
                                                            func_dict[var])
            deri_fstr = fcode(deri_dict[var], source_format='free',
                            assign_to='d'+var, standard=95)
            deri2_fstr = fcode(deri_dict['sigv'], source_format='free',
                            assign_to='dsigv', standard=95)
            file_str += deri_fstr + 'n'
            file_str += deri2_fstr + 'nn'
            if printon:
                print(deri_fstr)
                print(deri2_fstr)
        if printon:
            print('')
    if printon:
        print('|' + 78*'_' + '|n')
    file_str = file_str + 'nn'  # end with newlines
    if teston:
        test_fit(name, var_list, func_dict, data, saveon)

    # save fortran functions in file:
    filename = name + '_turbparas.inc'
    with open(filename, 'w') as f:
        f.write(file_str)


if __name__ == '__main__':
    name = 'tmpBUB'
    main(name, *config(name))

The code is currently not runable, because the input-file is missing. I couldn’t find a canonical way to upload data, please advise. It’s currently a 160kB .csv file.

The code runs and does what I want, but I don’t have any formal training in programming and I’m sure it could be improved. Speed is not a huge issue (really depends on how complicated the functions to be fit are), but reliability and adaptability are.

Some points I know are “wrong”:

  • too many comments that explain what the code does (I like those, because I forget stuff)
  • the input strings for the base sub-functions are too long (>80) and lack spaces around *. It’s a compromise.
  • if a height-range has less data points than the parameters of the corresponding subfunction, the code halts with a helpful minpack error message.

Some details I’d like to be different but also know to be impossible without changing sympy:

  • Fortran output with 4 instead of 3 spaces
  • Fortran output with 132 line length instead of 80.
  • A dynamic way to combine the pieces to one piecewise function to avoid those if conditions at the end of ‘fit_func’. (maybe that is possible?)


Get this bounty!!!

#StackBounty: #python #scipy #sympy Fitting multiple piecewise functions to data and return functions and derivatives as Fortran code

Bounty: 50

Background

For a future workshop I’ll have to fit arbitrary functions (independent variable is height z) to data from multiple sources (output of different numerical weather prediction models) in a yet unknown format (but basically gridded height/value pairs). The functions only have to interpolate the data and be differentiable. There should explicitly be no theoretical background for the type of function, but they should be smooth. The goal is to use the gridded (meaning discrete) output of the numerical weather prediction model in our pollutant dispersion model, which requires continuous functions.

Workflow

  1. choose the input model
  2. load input data
  3. define list of variables (not necessarily always the same)
  4. define height ranges (for the piecewise function)
  5. define base functions like “a0 + a1*z” for each height range and variable
  6. optionally define weights, because some parts are more important that others
  7. fit the piecewise functions
  8. save the fitted functions and their derivatives as Fortran 90 free form source code (to be included in our model)

I don’t think 1.-6. can be automated, but the rest should be.

Code

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
from sympy import log, ln, Piecewise, lambdify, symbols, sympify, fcode, sqrt


def config(name):
    """Configuration of the piecewise function fitting, dependent on input name

    Input:
    name... name of experiment to fit data to, basically chooses settings
    Output:
    var_list... list of variables to fit
    infunc_list_dict... dictionary with var_list as keys, each having a list as
        value that contains strings with the sub-function to fit, from
        the bottom up. Only the first (lowest) may have a constant value, all
        others must be 0 at the height they "take over" (where their argument
        is 0). There, the value of the lower, fitted function is added to
        ensure continuity. The parameters for each function HAVE to be of the
        pattern "aX", where "X" is numerically increasing (0, 1, 2...) within
        each sub-function.
        The arguments of aloft functions (not the bottom most) are usually
        "z - t", unless there is some trickery with "s"
        A constant, first sub-function is 'a0', while constant sub-function
        aloft has to be '0' for technical reasons.
        Variables replaced by values:
            - t... current threshold height
            - s... transition value at height t
            - zi.. bounday layer height
    thresh_list_dict... dictionary with var_list as keys, each having a list as
        value that contains the height where the piecewise functions change.
        for technical reasons the ground (0) and the top (np.inf) are also
        included.
    weight_list_dict... dictionary with var_list as keys, each having a list as
        value that contains relative weights (to 1) that are used to force the
        fitting to be closer to the real value at crucial points. This is
        around the threshold heights, at the ground and at the ABL. To "turn
        off" a weight, set it to 1. The first weight is at the ground and then
        there are two around each treshold height and the last at the top.
        i.e: [ground,
            lower-of-thresh0, upper-of-thresh0,
            lower-of-thresh1, upper-of-thresh1,
            ...
            top]
        the first function uses ground and lower-of-thresh0,
        the second uses upper-of-thresh0 and  lower-of-thresh1 until
        the last uses lower-of-threshI and top
    wefact_list_dict... analog to weight_list_dict, except that it contains
        the relative distance where the weight in weight_list_dict is applied.
        Relative distance means here: fraction of the total subrange. Typical
        values are 0.1 or 0.2, meaning 10 or 20% of the total subrange take the
        accompanying weight. If the corresponding weight equals 1, the value
        has no influence.
    teston... True: create plots; False: don't
    saveon... True: don't show plots, save them as pdfs (only if teston==True).
    printon... True: print output to console; False: don't
    """
    teston = True
    saveon = False
    printon = False

    # ========= TMP220 =========
    if name == 'tmp220':
        abl_height = 990
        var_list = ['um', 'u2', 'v2', 'w2', 'w3', 'uw', 'eps']
        infunc_list_dict = {
            'um': ['a0*ln(z-t)**3 + a1*ln(z-t)**2 + a2*ln(z-t) + a3'],
            'u2': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2'],
            'v2': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2'],
            'w2': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2'],
            'w3': ['a0',
                '0'],
            'uw': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2 + a2*(z-t)**3 + a3*(z-t)**4'],
            'eps': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                    'a0*(z-t)**a1 + a2*(z-t)**3 + a3*(z-t)**2 + a4*(z-t)**4 + a5*(z-t)**6'],
            }
        thresh_list_dict = {
            'um': [0.0, np.inf],
            'u2': [0.0, 12.5, np.inf],
            'v2': [0.0, 12.5, np.inf],
            'w2': [0.0, 12.5, np.inf],
            'w3': [0.0, 12.5, np.inf],
            'uw': [0.0, 12.5, np.inf],
            'eps': [0.0, 12.5, np.inf],
            }
        weight_list_dict = {
            'um': [100, 1],
            'u2': [100, 5000, 1, 1],
            'v2': [100, 5000, 1, 1],
            'w2': [100, 5000, 1, 1],
            'w3': [100, 5000, 1, 1],
            'uw': [100, 5000, 1, 1],
            'eps': [100, 5000, 1, 1],
            }
        wefact_list_dict = {
            'um': [0.2, 0.1],
            'u2': [0.2, 0.2, 0.1, 0.1],
            'v2': [0.2, 0.2, 0.1, 0.1],
            'w2': [0.2, 0.2, 0.1, 0.1],
            'w3': [0.2, 0.2, 0.1, 0.1],
            'uw': [0.2, 0.2, 0.1, 0.1],
            'eps': [0.2, 0.2, 0.1, 0.1],
            }
    #elif name == 'SOMETHING ELSE': analog to above, omitted for brevity
    else:
        raise ValueError('Unsupported name, configure in config()')

    return (var_list, abl_height, infunc_list_dict, thresh_list_dict,
            weight_list_dict, wefact_list_dict, teston, saveon, printon)


def read_scm_data(name_str):
    """This routines reads in the profiles from the SCMs

    Input: # TODO (depends on their format), for now dummy data
    Output: dataframe: z, u2, v2, w2, w3, uw, um, eps
    """
    # TODO: add actual read routine, this is just dummy input
    if name_str == 'tmp220':
        out = pd.read_csv('tmp220.csv', delimiter=',')
    #elif name_str == 'SOMETHING ELSE': as above, omitted for brevity
    else:
        raise ValueError('Unknown name, configure in read_scm_data()')
    return out


def test_fit(name, var_list, func_dict, data, saveon):
    """plot of data vs fitted functions
    """
    # Omitted for brevity, not that relevant


def fit_func(var, abl_height, data_z, data_v, infunc_str_list,
            thresh_list, weight_list, wefact_list):
    """Converts the piecewise defined functions in infunc_str_list with the
    thresholds in thresh_list (and the weights defined by weight_list and
    wefact_list) to a SymPy expression and fits it to (data_z, data_v), where
    data_z is height and data_v are the values in each height. Returns the
    piecewise SymPy function with substituded parameters.
    """
    z = symbols('z')
    y_list = []  # holds the subfunctions
    niterations = 20000
    # transition_value holds the value that is added to each sub-function
    # to ensure a continuous function. this is obviously 0 for the first
    # subfunction and equal to the value of the previous sub-function at the
    # threshold height for each subsequent sub-function.
    transition_value = 0

    # for each piece of the function:
    for i, func_str in enumerate(infunc_str_list):
        # find number of parameters and create those SymPy objects
        nparams = func_str.count('a')
        a = symbols('a0:%d' % nparams)
        t = symbols('t')  # transition height
        s = symbols('s')  # transition value
        zi = symbols('zi')  # boundary layer height

        # check the string and create the sympy expression
        verify_func_str(var, func_str)
        y_list.append(sympify(func_str))

        # add the transition value and substitute the placeholder variables:
        y_list[i] += transition_value
        y_list[i] = y_list[i].subs(t, thresh_list[i])
        y_list[i] = y_list[i].subs(s, transition_value)
        y_list[i] = y_list[i].subs(zi, abl_height)

        # lambdify the sympy-expression with a somewhat ugly hack:
        t = [z]
        for j in range(nparams):
            t.append(a[j])
        func = lambdify(tuple(t), y_list[i], modules=np)

        # create the correction subset of the data
        local_index = data_z > thresh_list[i] & data_z < thresh_list[i + 1]
        local_z = data_z[local_index]
        local_v = data_v[local_index]

        # create the weight arrays. they have the same size as the local_z and
        # are 1 everywhere except the range defined with wefact, where they
        # are the specified weight. see config() for definitions.
        weight = np.ones_like(local_z)
        z_range = local_z[-1] - local_z[0]
        lower_weight_lim = local_z[0] + wefact_list[2*i] * z_range
        upper_weight_lim = local_z[-1] - wefact_list[2*i + 1] * z_range
        weight[local_z < lower_weight_lim] = weight_list[2*i]
        weight[local_z > upper_weight_lim] = weight_list[2*i + 1]
        sigma = 1. / weight

        # fit the function to the data, checking for constant function aloft:
        if nparams > 0:
            popt, pcov = curve_fit(func, local_z, local_v, sigma=sigma,
                                maxfev=niterations)

        # substitute fitted parameters in sympy expression:
        for j in range(nparams):
            y_list[i] = y_list[i].subs(a[j], popt[j])

        # calculate the new transition_value:
        if nparams > 0:
            transition_value = func(thresh_list[i + 1], *popt)
        else:
            transition_value = func(thresh_list[i + 1])

    # After all sub-functions are fitted, combine them to a piecewise function.
    # This is a terrible hack, but I couldn't find out how to create piecewise
    # functions dynamically...
    if len(y_list) == 1:
        y = y_list[0]
    elif len(y_list) == 2:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], True))
    elif len(y_list) == 3:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], z <= thresh_list[2]),
                    (y_list[2], True))
    elif len(y_list) == 4:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], z <= thresh_list[2]),
                    (y_list[2], z <= thresh_list[3]),
                    (y_list[3], True))
    elif len(y_list) == 5:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], z <= thresh_list[2]),
                    (y_list[2], z <= thresh_list[3]),
                    (y_list[3], z <= thresh_list[4]),
                    (y_list[4], True))
    else:
        raise ValueError('More than five sub-functions not implemented yet')
    return y


def create_deriv(funcname, func):
    """Creates the derivative of the function, taking into account that v2 has
    two "derivatives".
    careful: returns tuple of two functions if funcname==v2, else one function
    """
    z = symbols('z')
    if funcname != 'v2':
        return func.diff(z)
    else:
        deriv = func.diff(z)
        deriv_sig = sqrt(func).diff(z)
        return (deriv, deriv_sig)


def verify_input(name, infunc_list, thresh_list,
                weight_list, wefact_list):
    """rudimentary checks if the functions, weights and thresholds are faulty
    """
    nfuncs = len(infunc_list)
    if len(thresh_list) != nfuncs + 1:
        raise ValueError('Number of functions and thresholds disagree for ' +
                        var + ' of ' + name)
    if len(weight_list) != nfuncs * 2:
        raise ValueError('Number of functions and weights disagree for ' +
                        var + ' of ' + name)
    if len(wefact_list) != nfuncs * 2:
        raise ValueError('Number of functions and weight factors disagree' +
                        ' for ' + var + ' of ' + name)


def verify_func_str(var, func_str):
    """Checks if the function string has linearly increasing parameters,
    starting with 0 (i.e. "a0, a1, a2..."), because otherwise there is only
    a cryptic error in minpack.
    """
    index_list = []
    for c, char in enumerate(func_str):
        if char == 'a':
            index_list.append(int(func_str[c+1]))
    if list(range(0, len(index_list))) != index_list:
        raise ValueError(func_str + ' has non-monotonically increasing' +
                        'parameter indices or does not start with a0' +
                        ' in variable ' + var)


def main(name, var_list, abl_height, infunc_list_dict, thresh_list_dict,
        weight_list_dict, wefact_list_dict, teston, saveon, printon):
    """Start routines, print output (if printon), save Fortran functions in a
    file and start testing (if teston)
    """
    func_dict, deri_dict = {}, {}
    # file_str stores everything that is written to file in one string
    file_str = '!' + 78*'=' + 'n' + '!     ' + name + 'n!' + 78*'=' + 'n'
    if printon:
        print(' ' + 78*'_' + ' ')
        print('|' + 78*' ' + '|')
        print('|' + 30*' ' + name + (48-len(name))*' ' + '|')
        print('|' + 78*' ' + '|')
    data = read_scm_data(name)
    for var in var_list:
        verify_input(name, infunc_list_dict[var], thresh_list_dict[var],
                    weight_list_dict[var], wefact_list_dict[var])
        if printon:
            print('! ----- ' + var)
        file_str += '! ----- ' + var + 'n'
        # use data.z.values to get rid of the pandas-overhead and because
        # some stuff is apparently not possible otherwise (like data.z[-1])
        func_dict[var] = fit_func(var, abl_height, data.z.values,
                                data[var].values, infunc_list_dict[var],
                                thresh_list_dict[var], weight_list_dict[var],
                                wefact_list_dict[var])
        func_fstr = fcode(func_dict[var], source_format='free', assign_to=var,
                        standard=95)
        if printon:
            print(func_fstr)
        file_str += func_fstr + 'n'
        if var != 'v2':
            deri_dict[var] = create_deriv(var, func_dict[var])
            deri_fstr = fcode(deri_dict[var], source_format='free',
                            assign_to='d'+var)
            if printon:
                print(deri_fstr)
            file_str += deri_fstr + 'nn'
        else:
            deri_dict[var], deri_dict['sigv'] = create_deriv(var,
                                                            func_dict[var])
            deri_fstr = fcode(deri_dict[var], source_format='free',
                            assign_to='d'+var, standard=95)
            deri2_fstr = fcode(deri_dict['sigv'], source_format='free',
                            assign_to='dsigv', standard=95)
            file_str += deri_fstr + 'n'
            file_str += deri2_fstr + 'nn'
            if printon:
                print(deri_fstr)
                print(deri2_fstr)
        if printon:
            print('')
    if printon:
        print('|' + 78*'_' + '|n')
    file_str = file_str + 'nn'  # end with newlines
    if teston:
        test_fit(name, var_list, func_dict, data, saveon)

    # save fortran functions in file:
    filename = name + '_turbparas.inc'
    with open(filename, 'w') as f:
        f.write(file_str)


if __name__ == '__main__':
    name = 'tmpBUB'
    main(name, *config(name))

The code is currently not runable, because the input-file is missing. I couldn’t find a canonical way to upload data, please advise. It’s currently a 160kB .csv file.

The code runs and does what I want, but I don’t have any formal training in programming and I’m sure it could be improved. Speed is not a huge issue (really depends on how complicated the functions to be fit are), but reliability and adaptability are.

Some points I know are “wrong”:

  • too many comments that explain what the code does (I like those, because I forget stuff)
  • the input strings for the base sub-functions are too long (>80) and lack spaces around *. It’s a compromise.
  • if a height-range has less data points than the parameters of the corresponding subfunction, the code halts with a helpful minpack error message.

Some details I’d like to be different but also know to be impossible without changing sympy:

  • Fortran output with 4 instead of 3 spaces
  • Fortran output with 132 line length instead of 80.
  • A dynamic way to combine the pieces to one piecewise function to avoid those if conditions at the end of ‘fit_func’. (maybe that is possible?)


Get this bounty!!!