#HackerRank: Computing the Correlation

Problem

You are given the scores of N students in three different subjects – MathematicsPhysics and Chemistry; all of which have been graded on a scale of 0 to 100. Your task is to compute the Pearson product-moment correlation coefficient between the scores of different pairs of subjects (Mathematics and Physics, Physics and Chemistry, Mathematics and Chemistry) based on this data. This data is based on the records of the CBSE K-12 Examination – a national school leaving examination in India, for the year 2013.

Pearson product-moment correlation coefficient

This is a measure of linear correlation described well on this Wikipedia page. The formula, in brief, is given by:

where x and y denote the two vectors between which the correlation is to be measured.

Input Format

The first row contains an integer N.
This is followed by N rows containing three tab-space (‘\t’) separated integers, M P C corresponding to a candidate’s scores in Mathematics, Physics and Chemistry respectively.
Each row corresponds to the scores attained by a unique candidate in these three subjects.

Input Constraints

1 <= N <= 5 x 105
0 <= M, P, C <= 100

Output Format

The output should contain three lines, with correlation coefficients computed
and rounded off correct to exactly 2 decimal places.
The first line should contain the correlation coefficient between Mathematics and Physics scores.
The second line should contain the correlation coefficient between Physics and Chemistry scores.
The third line should contain the correlation coefficient between Chemistry and Mathematics scores.

So, your output should look like this (these values are only for explanatory purposes):

0.12
0.13
0.95

Test Cases

There is one sample test case with scores obtained in Mathematics, Physics and Chemistry by 20 students. The hidden test case contains the scores obtained by all the candidates who appeared for the examination and took all three tests (Mathematics, Physics and Chemistry).
Think: How can you efficiently compute the correlation coefficients within the given time constraints, while handling the scores of nearly 400k students?

Sample Input

20
73  72  76
48  67  76
95  92  95
95  95  96
33  59  79
47  58  74
98  95  97
91  94  97
95  84  90
93  83  90
70  70  78
85  79  91
33  67  76
47  73  90
95  87  95
84  86  95
43  63  75
95  92  100
54  80  87
72  76  90

Sample Output

0.89  
0.92  
0.81

There is no special library support available for this challenge.

Solution(Source)

 

#StackBounty: #python #keras #convnet #audio-recognition Training a CNN with limited weight sharing

Bounty: 50

I am currently working with speech recognition, in which i would like to try to use CNN instead of the normal feature extraction step.

I been reading this paper which proposes method using cnn. The input is a visual representation of mel-log filter bank energies of audio files.

enter image description here

And the output is phoneme recognised for each third frame section a portion (a frames, b frequency_bands) of the image.

The network is a CNN, and they propose a different weight sharing – limited weight sharing, since the patterns being seeked for doesn’t occur equally anywhere on the image, but localised to certain frequency areas.

Using separate sets of weights for different frequency bands may be more suitable since it allows for detection of distinct feature patterns in different filter bands along the frequency axis. Fig. 5 shows an example of the limited weight sharing (LWS) scheme for CNNs, where only the convolution units that are attached to the same pooling unit share the same convolution weights. These convolution units need to share their weights so that they compute comparable features, which may then be pooled together.

I am not sure I understand the concept of this weight sharing..

Should the weights be shared for each frame but limited in frequency range?

or should it both be limited for each frame and frequency range?

They made a illustration of this weight sharing:

enter image description here

From what i can decipher from the image – is Limited weight sharing option 2.

Each frame do not have the same weight, multiple convolution is applied on to the same frame, and the convolution for the next frame starts at a lower frequency than than the previous frame and has stride = 2. So somehow is the convolution only performed in the diagonal of the image… sounds weird?

Sounds like i’ve misinterpreted something incorrectly here?
Any ideas on how to implement?


Get this bounty!!!

#StackBounty: #python #numerical-methods #scipy #sympy Fitting multiple piecewise functions to data and return functions and derivative…

Bounty: 50

Background

For a future workshop I’ll have to fit arbitrary functions (independent variable is height z) to data from multiple sources (output of different numerical weather prediction models) in a yet unknown format (but basically gridded height/value pairs). The functions only have to interpolate the data and be differentiable. There should explicitly be no theoretical background for the type of function, but they should be smooth. The goal is to use the gridded (meaning discrete) output of the numerical weather prediction model in our pollutant dispersion model, which requires continuous functions.

Workflow

  1. choose the input model
  2. load input data
  3. define list of variables (not necessarily always the same)
  4. define height ranges (for the piecewise function)
  5. define base functions like “a0 + a1*z” for each height range and variable
  6. optionally define weights, because some parts are more important that others
  7. fit the piecewise functions
  8. save the fitted functions and their derivatives as Fortran 90 free form source code (to be included in our model)

I don’t think 1.-6. can be automated, but the rest should be.

Code

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
from sympy import log, ln, Piecewise, lambdify, symbols, sympify, fcode, sqrt


def config(name):
    """Configuration of the piecewise function fitting, dependent on input name

    Input:
    name... name of experiment to fit data to, basically chooses settings
    Output:
    var_list... list of variables to fit
    infunc_list_dict... dictionary with var_list as keys, each having a list as
        value that contains strings with the sub-function to fit, from
        the bottom up. Only the first (lowest) may have a constant value, all
        others must be 0 at the height they "take over" (where their argument
        is 0). There, the value of the lower, fitted function is added to
        ensure continuity. The parameters for each function HAVE to be of the
        pattern "aX", where "X" is numerically increasing (0, 1, 2...) within
        each sub-function.
        The arguments of aloft functions (not the bottom most) are usually
        "z - t", unless there is some trickery with "s"
        A constant, first sub-function is 'a0', while constant sub-function
        aloft has to be '0' for technical reasons.
        Variables replaced by values:
            - t... current threshold height
            - s... transition value at height t
            - zi.. bounday layer height
    thresh_list_dict... dictionary with var_list as keys, each having a list as
        value that contains the height where the piecewise functions change.
        for technical reasons the ground (0) and the top (np.inf) are also
        included.
    weight_list_dict... dictionary with var_list as keys, each having a list as
        value that contains relative weights (to 1) that are used to force the
        fitting to be closer to the real value at crucial points. This is
        around the threshold heights, at the ground and at the ABL. To "turn
        off" a weight, set it to 1. The first weight is at the ground and then
        there are two around each treshold height and the last at the top.
        i.e: [ground,
            lower-of-thresh0, upper-of-thresh0,
            lower-of-thresh1, upper-of-thresh1,
            ...
            top]
        the first function uses ground and lower-of-thresh0,
        the second uses upper-of-thresh0 and  lower-of-thresh1 until
        the last uses lower-of-threshI and top
    wefact_list_dict... analog to weight_list_dict, except that it contains
        the relative distance where the weight in weight_list_dict is applied.
        Relative distance means here: fraction of the total subrange. Typical
        values are 0.1 or 0.2, meaning 10 or 20% of the total subrange take the
        accompanying weight. If the corresponding weight equals 1, the value
        has no influence.
    teston... True: create plots; False: don't
    saveon... True: don't show plots, save them as pdfs (only if teston==True).
    printon... True: print output to console; False: don't
    """
    teston = True
    saveon = False
    printon = False

    # ========= TMP220 =========
    if name == 'tmp220':
        abl_height = 990
        var_list = ['um', 'u2', 'v2', 'w2', 'w3', 'uw', 'eps']
        infunc_list_dict = {
            'um': ['a0*ln(z-t)**3 + a1*ln(z-t)**2 + a2*ln(z-t) + a3'],
            'u2': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2'],
            'v2': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2'],
            'w2': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2'],
            'w3': ['a0',
                '0'],
            'uw': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2 + a2*(z-t)**3 + a3*(z-t)**4'],
            'eps': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                    'a0*(z-t)**a1 + a2*(z-t)**3 + a3*(z-t)**2 + a4*(z-t)**4 + a5*(z-t)**6'],
            }
        thresh_list_dict = {
            'um': [0.0, np.inf],
            'u2': [0.0, 12.5, np.inf],
            'v2': [0.0, 12.5, np.inf],
            'w2': [0.0, 12.5, np.inf],
            'w3': [0.0, 12.5, np.inf],
            'uw': [0.0, 12.5, np.inf],
            'eps': [0.0, 12.5, np.inf],
            }
        weight_list_dict = {
            'um': [100, 1],
            'u2': [100, 5000, 1, 1],
            'v2': [100, 5000, 1, 1],
            'w2': [100, 5000, 1, 1],
            'w3': [100, 5000, 1, 1],
            'uw': [100, 5000, 1, 1],
            'eps': [100, 5000, 1, 1],
            }
        wefact_list_dict = {
            'um': [0.2, 0.1],
            'u2': [0.2, 0.2, 0.1, 0.1],
            'v2': [0.2, 0.2, 0.1, 0.1],
            'w2': [0.2, 0.2, 0.1, 0.1],
            'w3': [0.2, 0.2, 0.1, 0.1],
            'uw': [0.2, 0.2, 0.1, 0.1],
            'eps': [0.2, 0.2, 0.1, 0.1],
            }
    #elif name == 'SOMETHING ELSE': analog to above, omitted for brevity
    else:
        raise ValueError('Unsupported name, configure in config()')

    return (var_list, abl_height, infunc_list_dict, thresh_list_dict,
            weight_list_dict, wefact_list_dict, teston, saveon, printon)


def read_scm_data(name_str):
    """This routines reads in the profiles from the SCMs

    Input: # TODO (depends on their format), for now dummy data
    Output: dataframe: z, u2, v2, w2, w3, uw, um, eps
    """
    # TODO: add actual read routine, this is just dummy input
    if name_str == 'tmp220':
        out = pd.read_csv('tmp220.csv', delimiter=',')
    #elif name_str == 'SOMETHING ELSE': as above, omitted for brevity
    else:
        raise ValueError('Unknown name, configure in read_scm_data()')
    return out


def test_fit(name, var_list, func_dict, data, saveon):
    """plot of data vs fitted functions
    """
    # Omitted for brevity, not that relevant


def fit_func(var, abl_height, data_z, data_v, infunc_str_list,
            thresh_list, weight_list, wefact_list):
    """Converts the piecewise defined functions in infunc_str_list with the
    thresholds in thresh_list (and the weights defined by weight_list and
    wefact_list) to a SymPy expression and fits it to (data_z, data_v), where
    data_z is height and data_v are the values in each height. Returns the
    piecewise SymPy function with substituded parameters.
    """
    z = symbols('z')
    y_list = []  # holds the subfunctions
    niterations = 20000
    # transition_value holds the value that is added to each sub-function
    # to ensure a continuous function. this is obviously 0 for the first
    # subfunction and equal to the value of the previous sub-function at the
    # threshold height for each subsequent sub-function.
    transition_value = 0

    # for each piece of the function:
    for i, func_str in enumerate(infunc_str_list):
        # find number of parameters and create those SymPy objects
        nparams = func_str.count('a')
        a = symbols('a0:%d' % nparams)
        t = symbols('t')  # transition height
        s = symbols('s')  # transition value
        zi = symbols('zi')  # boundary layer height

        # check the string and create the sympy expression
        verify_func_str(var, func_str)
        y_list.append(sympify(func_str))

        # add the transition value and substitute the placeholder variables:
        y_list[i] += transition_value
        y_list[i] = y_list[i].subs(t, thresh_list[i])
        y_list[i] = y_list[i].subs(s, transition_value)
        y_list[i] = y_list[i].subs(zi, abl_height)

        # lambdify the sympy-expression with a somewhat ugly hack:
        t = [z]
        for j in range(nparams):
            t.append(a[j])
        func = lambdify(tuple(t), y_list[i], modules=np)

        # create the correction subset of the data
        local_index = data_z > thresh_list[i] & data_z < thresh_list[i + 1]
        local_z = data_z[local_index]
        local_v = data_v[local_index]

        # create the weight arrays. they have the same size as the local_z and
        # are 1 everywhere except the range defined with wefact, where they
        # are the specified weight. see config() for definitions.
        weight = np.ones_like(local_z)
        z_range = local_z[-1] - local_z[0]
        lower_weight_lim = local_z[0] + wefact_list[2*i] * z_range
        upper_weight_lim = local_z[-1] - wefact_list[2*i + 1] * z_range
        weight[local_z < lower_weight_lim] = weight_list[2*i]
        weight[local_z > upper_weight_lim] = weight_list[2*i + 1]
        sigma = 1. / weight

        # fit the function to the data, checking for constant function aloft:
        if nparams > 0:
            popt, pcov = curve_fit(func, local_z, local_v, sigma=sigma,
                                maxfev=niterations)

        # substitute fitted parameters in sympy expression:
        for j in range(nparams):
            y_list[i] = y_list[i].subs(a[j], popt[j])

        # calculate the new transition_value:
        if nparams > 0:
            transition_value = func(thresh_list[i + 1], *popt)
        else:
            transition_value = func(thresh_list[i + 1])

    # After all sub-functions are fitted, combine them to a piecewise function.
    # This is a terrible hack, but I couldn't find out how to create piecewise
    # functions dynamically...
    if len(y_list) == 1:
        y = y_list[0]
    elif len(y_list) == 2:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], True))
    elif len(y_list) == 3:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], z <= thresh_list[2]),
                    (y_list[2], True))
    elif len(y_list) == 4:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], z <= thresh_list[2]),
                    (y_list[2], z <= thresh_list[3]),
                    (y_list[3], True))
    elif len(y_list) == 5:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], z <= thresh_list[2]),
                    (y_list[2], z <= thresh_list[3]),
                    (y_list[3], z <= thresh_list[4]),
                    (y_list[4], True))
    else:
        raise ValueError('More than five sub-functions not implemented yet')
    return y


def create_deriv(funcname, func):
    """Creates the derivative of the function, taking into account that v2 has
    two "derivatives".
    careful: returns tuple of two functions if funcname==v2, else one function
    """
    z = symbols('z')
    if funcname != 'v2':
        return func.diff(z)
    else:
        deriv = func.diff(z)
        deriv_sig = sqrt(func).diff(z)
        return (deriv, deriv_sig)


def verify_input(name, infunc_list, thresh_list,
                weight_list, wefact_list):
    """rudimentary checks if the functions, weights and thresholds are faulty
    """
    nfuncs = len(infunc_list)
    if len(thresh_list) != nfuncs + 1:
        raise ValueError('Number of functions and thresholds disagree for ' +
                        var + ' of ' + name)
    if len(weight_list) != nfuncs * 2:
        raise ValueError('Number of functions and weights disagree for ' +
                        var + ' of ' + name)
    if len(wefact_list) != nfuncs * 2:
        raise ValueError('Number of functions and weight factors disagree' +
                        ' for ' + var + ' of ' + name)


def verify_func_str(var, func_str):
    """Checks if the function string has linearly increasing parameters,
    starting with 0 (i.e. "a0, a1, a2..."), because otherwise there is only
    a cryptic error in minpack.
    """
    index_list = []
    for c, char in enumerate(func_str):
        if char == 'a':
            index_list.append(int(func_str[c+1]))
    if list(range(0, len(index_list))) != index_list:
        raise ValueError(func_str + ' has non-monotonically increasing' +
                        'parameter indices or does not start with a0' +
                        ' in variable ' + var)


def main(name, var_list, abl_height, infunc_list_dict, thresh_list_dict,
        weight_list_dict, wefact_list_dict, teston, saveon, printon):
    """Start routines, print output (if printon), save Fortran functions in a
    file and start testing (if teston)
    """
    func_dict, deri_dict = {}, {}
    # file_str stores everything that is written to file in one string
    file_str = '!' + 78*'=' + 'n' + '!     ' + name + 'n!' + 78*'=' + 'n'
    if printon:
        print(' ' + 78*'_' + ' ')
        print('|' + 78*' ' + '|')
        print('|' + 30*' ' + name + (48-len(name))*' ' + '|')
        print('|' + 78*' ' + '|')
    data = read_scm_data(name)
    for var in var_list:
        verify_input(name, infunc_list_dict[var], thresh_list_dict[var],
                    weight_list_dict[var], wefact_list_dict[var])
        if printon:
            print('! ----- ' + var)
        file_str += '! ----- ' + var + 'n'
        # use data.z.values to get rid of the pandas-overhead and because
        # some stuff is apparently not possible otherwise (like data.z[-1])
        func_dict[var] = fit_func(var, abl_height, data.z.values,
                                data[var].values, infunc_list_dict[var],
                                thresh_list_dict[var], weight_list_dict[var],
                                wefact_list_dict[var])
        func_fstr = fcode(func_dict[var], source_format='free', assign_to=var,
                        standard=95)
        if printon:
            print(func_fstr)
        file_str += func_fstr + 'n'
        if var != 'v2':
            deri_dict[var] = create_deriv(var, func_dict[var])
            deri_fstr = fcode(deri_dict[var], source_format='free',
                            assign_to='d'+var)
            if printon:
                print(deri_fstr)
            file_str += deri_fstr + 'nn'
        else:
            deri_dict[var], deri_dict['sigv'] = create_deriv(var,
                                                            func_dict[var])
            deri_fstr = fcode(deri_dict[var], source_format='free',
                            assign_to='d'+var, standard=95)
            deri2_fstr = fcode(deri_dict['sigv'], source_format='free',
                            assign_to='dsigv', standard=95)
            file_str += deri_fstr + 'n'
            file_str += deri2_fstr + 'nn'
            if printon:
                print(deri_fstr)
                print(deri2_fstr)
        if printon:
            print('')
    if printon:
        print('|' + 78*'_' + '|n')
    file_str = file_str + 'nn'  # end with newlines
    if teston:
        test_fit(name, var_list, func_dict, data, saveon)

    # save fortran functions in file:
    filename = name + '_turbparas.inc'
    with open(filename, 'w') as f:
        f.write(file_str)


if __name__ == '__main__':
    name = 'tmpBUB'
    main(name, *config(name))

The code is currently not runable, because the input-file is missing. I couldn’t find a canonical way to upload data, please advise. It’s currently a 160kB .csv file.

The code runs and does what I want, but I don’t have any formal training in programming and I’m sure it could be improved. Speed is not a huge issue (really depends on how complicated the functions to be fit are), but reliability and adaptability are.

Some points I know are “wrong”:

  • too many comments that explain what the code does (I like those, because I forget stuff)
  • the input strings for the base sub-functions are too long (>80) and lack spaces around *. It’s a compromise.
  • if a height-range has less data points than the parameters of the corresponding subfunction, the code halts with a helpful minpack error message.

Some details I’d like to be different but also know to be impossible without changing sympy:

  • Fortran output with 4 instead of 3 spaces
  • Fortran output with 132 line length instead of 80.
  • A dynamic way to combine the pieces to one piecewise function to avoid those if conditions at the end of ‘fit_func’. (maybe that is possible?)


Get this bounty!!!

#StackBounty: #python #asynchronous #tensorflow #neural-network #distributed How does asynchronous training work in distributed Tensorf…

Bounty: 50

I’ve read Distributed Tensorflow Doc, and it mentions that in asynchronous training,

each replica of the graph has an independent training loop that executes without coordination.

From what I understand, if we use parameter-server with data parallelism architecture, it means each worker computes gradients and updates its own weights without caring about other workers updates for distributed training Neural Network. As all weights are shared on parameter server (ps), I think ps still has to coordinate (or aggregate) weight updates from all workers in some way. I wonder how does the aggregation work in asynchronous training. Or in more general words, how does asynchronous training work in distributed Tensorflow?


Get this bounty!!!

#StackBounty: #python #scipy #sympy Fitting multiple piecewise functions to data and return functions and derivatives as Fortran code

Bounty: 50

Background

For a future workshop I’ll have to fit arbitrary functions (independent variable is height z) to data from multiple sources (output of different numerical weather prediction models) in a yet unknown format (but basically gridded height/value pairs). The functions only have to interpolate the data and be differentiable. There should explicitly be no theoretical background for the type of function, but they should be smooth. The goal is to use the gridded (meaning discrete) output of the numerical weather prediction model in our pollutant dispersion model, which requires continuous functions.

Workflow

  1. choose the input model
  2. load input data
  3. define list of variables (not necessarily always the same)
  4. define height ranges (for the piecewise function)
  5. define base functions like “a0 + a1*z” for each height range and variable
  6. optionally define weights, because some parts are more important that others
  7. fit the piecewise functions
  8. save the fitted functions and their derivatives as Fortran 90 free form source code (to be included in our model)

I don’t think 1.-6. can be automated, but the rest should be.

Code

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
from sympy import log, ln, Piecewise, lambdify, symbols, sympify, fcode, sqrt


def config(name):
    """Configuration of the piecewise function fitting, dependent on input name

    Input:
    name... name of experiment to fit data to, basically chooses settings
    Output:
    var_list... list of variables to fit
    infunc_list_dict... dictionary with var_list as keys, each having a list as
        value that contains strings with the sub-function to fit, from
        the bottom up. Only the first (lowest) may have a constant value, all
        others must be 0 at the height they "take over" (where their argument
        is 0). There, the value of the lower, fitted function is added to
        ensure continuity. The parameters for each function HAVE to be of the
        pattern "aX", where "X" is numerically increasing (0, 1, 2...) within
        each sub-function.
        The arguments of aloft functions (not the bottom most) are usually
        "z - t", unless there is some trickery with "s"
        A constant, first sub-function is 'a0', while constant sub-function
        aloft has to be '0' for technical reasons.
        Variables replaced by values:
            - t... current threshold height
            - s... transition value at height t
            - zi.. bounday layer height
    thresh_list_dict... dictionary with var_list as keys, each having a list as
        value that contains the height where the piecewise functions change.
        for technical reasons the ground (0) and the top (np.inf) are also
        included.
    weight_list_dict... dictionary with var_list as keys, each having a list as
        value that contains relative weights (to 1) that are used to force the
        fitting to be closer to the real value at crucial points. This is
        around the threshold heights, at the ground and at the ABL. To "turn
        off" a weight, set it to 1. The first weight is at the ground and then
        there are two around each treshold height and the last at the top.
        i.e: [ground,
            lower-of-thresh0, upper-of-thresh0,
            lower-of-thresh1, upper-of-thresh1,
            ...
            top]
        the first function uses ground and lower-of-thresh0,
        the second uses upper-of-thresh0 and  lower-of-thresh1 until
        the last uses lower-of-threshI and top
    wefact_list_dict... analog to weight_list_dict, except that it contains
        the relative distance where the weight in weight_list_dict is applied.
        Relative distance means here: fraction of the total subrange. Typical
        values are 0.1 or 0.2, meaning 10 or 20% of the total subrange take the
        accompanying weight. If the corresponding weight equals 1, the value
        has no influence.
    teston... True: create plots; False: don't
    saveon... True: don't show plots, save them as pdfs (only if teston==True).
    printon... True: print output to console; False: don't
    """
    teston = True
    saveon = False
    printon = False

    # ========= TMP220 =========
    if name == 'tmp220':
        abl_height = 990
        var_list = ['um', 'u2', 'v2', 'w2', 'w3', 'uw', 'eps']
        infunc_list_dict = {
            'um': ['a0*ln(z-t)**3 + a1*ln(z-t)**2 + a2*ln(z-t) + a3'],
            'u2': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2'],
            'v2': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2'],
            'w2': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2'],
            'w3': ['a0',
                '0'],
            'uw': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                'a0*(z-t) + a1*(z-t)**2 + a2*(z-t)**3 + a3*(z-t)**4'],
            'eps': ['a0 + a1*(z-t) + a2*(z-t)**2 + a3*(z-t)**3 + a4*(z-t)**4 + a5*(z-t)**5',
                    'a0*(z-t)**a1 + a2*(z-t)**3 + a3*(z-t)**2 + a4*(z-t)**4 + a5*(z-t)**6'],
            }
        thresh_list_dict = {
            'um': [0.0, np.inf],
            'u2': [0.0, 12.5, np.inf],
            'v2': [0.0, 12.5, np.inf],
            'w2': [0.0, 12.5, np.inf],
            'w3': [0.0, 12.5, np.inf],
            'uw': [0.0, 12.5, np.inf],
            'eps': [0.0, 12.5, np.inf],
            }
        weight_list_dict = {
            'um': [100, 1],
            'u2': [100, 5000, 1, 1],
            'v2': [100, 5000, 1, 1],
            'w2': [100, 5000, 1, 1],
            'w3': [100, 5000, 1, 1],
            'uw': [100, 5000, 1, 1],
            'eps': [100, 5000, 1, 1],
            }
        wefact_list_dict = {
            'um': [0.2, 0.1],
            'u2': [0.2, 0.2, 0.1, 0.1],
            'v2': [0.2, 0.2, 0.1, 0.1],
            'w2': [0.2, 0.2, 0.1, 0.1],
            'w3': [0.2, 0.2, 0.1, 0.1],
            'uw': [0.2, 0.2, 0.1, 0.1],
            'eps': [0.2, 0.2, 0.1, 0.1],
            }
    #elif name == 'SOMETHING ELSE': analog to above, omitted for brevity
    else:
        raise ValueError('Unsupported name, configure in config()')

    return (var_list, abl_height, infunc_list_dict, thresh_list_dict,
            weight_list_dict, wefact_list_dict, teston, saveon, printon)


def read_scm_data(name_str):
    """This routines reads in the profiles from the SCMs

    Input: # TODO (depends on their format), for now dummy data
    Output: dataframe: z, u2, v2, w2, w3, uw, um, eps
    """
    # TODO: add actual read routine, this is just dummy input
    if name_str == 'tmp220':
        out = pd.read_csv('tmp220.csv', delimiter=',')
    #elif name_str == 'SOMETHING ELSE': as above, omitted for brevity
    else:
        raise ValueError('Unknown name, configure in read_scm_data()')
    return out


def test_fit(name, var_list, func_dict, data, saveon):
    """plot of data vs fitted functions
    """
    # Omitted for brevity, not that relevant


def fit_func(var, abl_height, data_z, data_v, infunc_str_list,
            thresh_list, weight_list, wefact_list):
    """Converts the piecewise defined functions in infunc_str_list with the
    thresholds in thresh_list (and the weights defined by weight_list and
    wefact_list) to a SymPy expression and fits it to (data_z, data_v), where
    data_z is height and data_v are the values in each height. Returns the
    piecewise SymPy function with substituded parameters.
    """
    z = symbols('z')
    y_list = []  # holds the subfunctions
    niterations = 20000
    # transition_value holds the value that is added to each sub-function
    # to ensure a continuous function. this is obviously 0 for the first
    # subfunction and equal to the value of the previous sub-function at the
    # threshold height for each subsequent sub-function.
    transition_value = 0

    # for each piece of the function:
    for i, func_str in enumerate(infunc_str_list):
        # find number of parameters and create those SymPy objects
        nparams = func_str.count('a')
        a = symbols('a0:%d' % nparams)
        t = symbols('t')  # transition height
        s = symbols('s')  # transition value
        zi = symbols('zi')  # boundary layer height

        # check the string and create the sympy expression
        verify_func_str(var, func_str)
        y_list.append(sympify(func_str))

        # add the transition value and substitute the placeholder variables:
        y_list[i] += transition_value
        y_list[i] = y_list[i].subs(t, thresh_list[i])
        y_list[i] = y_list[i].subs(s, transition_value)
        y_list[i] = y_list[i].subs(zi, abl_height)

        # lambdify the sympy-expression with a somewhat ugly hack:
        t = [z]
        for j in range(nparams):
            t.append(a[j])
        func = lambdify(tuple(t), y_list[i], modules=np)

        # create the correction subset of the data
        local_index = data_z > thresh_list[i] & data_z < thresh_list[i + 1]
        local_z = data_z[local_index]
        local_v = data_v[local_index]

        # create the weight arrays. they have the same size as the local_z and
        # are 1 everywhere except the range defined with wefact, where they
        # are the specified weight. see config() for definitions.
        weight = np.ones_like(local_z)
        z_range = local_z[-1] - local_z[0]
        lower_weight_lim = local_z[0] + wefact_list[2*i] * z_range
        upper_weight_lim = local_z[-1] - wefact_list[2*i + 1] * z_range
        weight[local_z < lower_weight_lim] = weight_list[2*i]
        weight[local_z > upper_weight_lim] = weight_list[2*i + 1]
        sigma = 1. / weight

        # fit the function to the data, checking for constant function aloft:
        if nparams > 0:
            popt, pcov = curve_fit(func, local_z, local_v, sigma=sigma,
                                maxfev=niterations)

        # substitute fitted parameters in sympy expression:
        for j in range(nparams):
            y_list[i] = y_list[i].subs(a[j], popt[j])

        # calculate the new transition_value:
        if nparams > 0:
            transition_value = func(thresh_list[i + 1], *popt)
        else:
            transition_value = func(thresh_list[i + 1])

    # After all sub-functions are fitted, combine them to a piecewise function.
    # This is a terrible hack, but I couldn't find out how to create piecewise
    # functions dynamically...
    if len(y_list) == 1:
        y = y_list[0]
    elif len(y_list) == 2:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], True))
    elif len(y_list) == 3:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], z <= thresh_list[2]),
                    (y_list[2], True))
    elif len(y_list) == 4:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], z <= thresh_list[2]),
                    (y_list[2], z <= thresh_list[3]),
                    (y_list[3], True))
    elif len(y_list) == 5:
        y = Piecewise((y_list[0], z <= thresh_list[1]),
                    (y_list[1], z <= thresh_list[2]),
                    (y_list[2], z <= thresh_list[3]),
                    (y_list[3], z <= thresh_list[4]),
                    (y_list[4], True))
    else:
        raise ValueError('More than five sub-functions not implemented yet')
    return y


def create_deriv(funcname, func):
    """Creates the derivative of the function, taking into account that v2 has
    two "derivatives".
    careful: returns tuple of two functions if funcname==v2, else one function
    """
    z = symbols('z')
    if funcname != 'v2':
        return func.diff(z)
    else:
        deriv = func.diff(z)
        deriv_sig = sqrt(func).diff(z)
        return (deriv, deriv_sig)


def verify_input(name, infunc_list, thresh_list,
                weight_list, wefact_list):
    """rudimentary checks if the functions, weights and thresholds are faulty
    """
    nfuncs = len(infunc_list)
    if len(thresh_list) != nfuncs + 1:
        raise ValueError('Number of functions and thresholds disagree for ' +
                        var + ' of ' + name)
    if len(weight_list) != nfuncs * 2:
        raise ValueError('Number of functions and weights disagree for ' +
                        var + ' of ' + name)
    if len(wefact_list) != nfuncs * 2:
        raise ValueError('Number of functions and weight factors disagree' +
                        ' for ' + var + ' of ' + name)


def verify_func_str(var, func_str):
    """Checks if the function string has linearly increasing parameters,
    starting with 0 (i.e. "a0, a1, a2..."), because otherwise there is only
    a cryptic error in minpack.
    """
    index_list = []
    for c, char in enumerate(func_str):
        if char == 'a':
            index_list.append(int(func_str[c+1]))
    if list(range(0, len(index_list))) != index_list:
        raise ValueError(func_str + ' has non-monotonically increasing' +
                        'parameter indices or does not start with a0' +
                        ' in variable ' + var)


def main(name, var_list, abl_height, infunc_list_dict, thresh_list_dict,
        weight_list_dict, wefact_list_dict, teston, saveon, printon):
    """Start routines, print output (if printon), save Fortran functions in a
    file and start testing (if teston)
    """
    func_dict, deri_dict = {}, {}
    # file_str stores everything that is written to file in one string
    file_str = '!' + 78*'=' + 'n' + '!     ' + name + 'n!' + 78*'=' + 'n'
    if printon:
        print(' ' + 78*'_' + ' ')
        print('|' + 78*' ' + '|')
        print('|' + 30*' ' + name + (48-len(name))*' ' + '|')
        print('|' + 78*' ' + '|')
    data = read_scm_data(name)
    for var in var_list:
        verify_input(name, infunc_list_dict[var], thresh_list_dict[var],
                    weight_list_dict[var], wefact_list_dict[var])
        if printon:
            print('! ----- ' + var)
        file_str += '! ----- ' + var + 'n'
        # use data.z.values to get rid of the pandas-overhead and because
        # some stuff is apparently not possible otherwise (like data.z[-1])
        func_dict[var] = fit_func(var, abl_height, data.z.values,
                                data[var].values, infunc_list_dict[var],
                                thresh_list_dict[var], weight_list_dict[var],
                                wefact_list_dict[var])
        func_fstr = fcode(func_dict[var], source_format='free', assign_to=var,
                        standard=95)
        if printon:
            print(func_fstr)
        file_str += func_fstr + 'n'
        if var != 'v2':
            deri_dict[var] = create_deriv(var, func_dict[var])
            deri_fstr = fcode(deri_dict[var], source_format='free',
                            assign_to='d'+var)
            if printon:
                print(deri_fstr)
            file_str += deri_fstr + 'nn'
        else:
            deri_dict[var], deri_dict['sigv'] = create_deriv(var,
                                                            func_dict[var])
            deri_fstr = fcode(deri_dict[var], source_format='free',
                            assign_to='d'+var, standard=95)
            deri2_fstr = fcode(deri_dict['sigv'], source_format='free',
                            assign_to='dsigv', standard=95)
            file_str += deri_fstr + 'n'
            file_str += deri2_fstr + 'nn'
            if printon:
                print(deri_fstr)
                print(deri2_fstr)
        if printon:
            print('')
    if printon:
        print('|' + 78*'_' + '|n')
    file_str = file_str + 'nn'  # end with newlines
    if teston:
        test_fit(name, var_list, func_dict, data, saveon)

    # save fortran functions in file:
    filename = name + '_turbparas.inc'
    with open(filename, 'w') as f:
        f.write(file_str)


if __name__ == '__main__':
    name = 'tmpBUB'
    main(name, *config(name))

The code is currently not runable, because the input-file is missing. I couldn’t find a canonical way to upload data, please advise. It’s currently a 160kB .csv file.

The code runs and does what I want, but I don’t have any formal training in programming and I’m sure it could be improved. Speed is not a huge issue (really depends on how complicated the functions to be fit are), but reliability and adaptability are.

Some points I know are “wrong”:

  • too many comments that explain what the code does (I like those, because I forget stuff)
  • the input strings for the base sub-functions are too long (>80) and lack spaces around *. It’s a compromise.
  • if a height-range has less data points than the parameters of the corresponding subfunction, the code halts with a helpful minpack error message.

Some details I’d like to be different but also know to be impossible without changing sympy:

  • Fortran output with 4 instead of 3 spaces
  • Fortran output with 132 line length instead of 80.
  • A dynamic way to combine the pieces to one piecewise function to avoid those if conditions at the end of ‘fit_func’. (maybe that is possible?)


Get this bounty!!!

#StackBounty: #python #python-3.x #asynchronous #async-await Checking HTTP headers with asyncio and aiohttp

Bounty: 50

This is one of my first attempts to do something practical with asyncio. The task is simple:

Given a list of URLs, determine if the content type is HTML for every URL.

I’ve used aiohttp, initializing a single “session”, ignoring SSL errors and issuing HEAD requests to avoid downloading the whole endpoint body. Then, I simply check if text/html is inside the Content-Type header string:

import asyncio

import aiohttp


@asyncio.coroutine
def is_html(session, url):
    response = yield from session.head(url, compress=True)
    print(url, "text/html" in response.headers["Content-Type"])


if __name__ == '__main__':
    links = ["https://httpbin.org/html",
             "https://httpbin.org/image/png",
             "https://httpbin.org/image/svg",
             "https://httpbin.org/image"]
    loop = asyncio.get_event_loop()

    conn = aiohttp.TCPConnector(verify_ssl=False)
    with aiohttp.ClientSession(connector=conn, loop=loop) as session:
        f = asyncio.wait([is_html(session, link) for link in links])
        loop.run_until_complete(f)

The code works, it prints (the output order is inconsistent, of course):

https://httpbin.org/image/svg False
https://httpbin.org/image False
https://httpbin.org/image/png False
https://httpbin.org/html True

But, I’m not sure if I’m using asyncio loop, wait and coroutines, aiohttp‘s connection and session objects appropriately. What would you recommend to improve?


Get this bounty!!!

#StackBounty: #python #selenium #twitter #web-scraping #instagram Web Scraping with Selenium Python [Twitter + Instagram]

Bounty: 50

I am trying to web scrape both Instagram and Twitter based on geolocation.
I can run a query search but I am having challenges in reloading the web page to to more and store the fields to data-frame.

I did find couple of examples for web scraping twitter and Instagram without API keys. But they are with respect to #tags keywords.

I am trying to scrape with respect to geo location and between old dates. so far I have come this far with writing code in python 3.X and all the latest versions of packages in anaconda.

'''
    Instagram - Components
    "id": "1478232643287060472", 
     "dimensions": {"height": 1080, "width": 1080}, 
     "owner": {"id": "351633262"}, 
     "thumbnail_src": "https://instagram.fdel1-1.fna.fbcdn.net/t51.2885-15/s640x640/sh0.08/e35/17439262_973184322815940_668652714938335232_n.jpg", 
     "is_video": false, 
     "code": "BSDvMHOgw_4", 
     "date": 1490439084, 
     "taken-at=213385402"
     "display_src": "https://instagram.fdel1-1.fna.fbcdn.net/t51.2885-15/e35/17439262_973184322815940_668652714938335232_n.jpg", 
     "caption": "Hakuna jambo zuri kama kumpa Mungu shukrani kwa kila jambo.. ud83dude4fud83cudffenIts weekendn#lifeistooshorttobeunhappyn#Godisgood n#happysoul ud83dude00", 
     "comments": {"count": 42}, 
     "likes": {"count": 3813}}, 
'''


import selenium
from selenium import webdriver
#from selenium import selenium
from bs4 import BeautifulSoup
import pandas

#geotags = pd.read_csv("geocodes.csv")
#parmalink = 
query = geocode%3A35.68501%2C139.7514%2C30km%20since:2016-03-01%20until:2016-03-02&f=tweets

twitterURL = 'https://twitter.com/search?q=' + query
#instaURL = "https://www.instagram.com/explore/locations/213385402/"


browser = webdriver.Firefox()
browser.get(twitterURL)
content = browser.page_source

soup = BeautifulSoup(content)
print (soup)

For Twitter Search Query I am getting syntax error
For Instagram I am not getting any error but I am not able to reload for more posts and write back to csv dataframe.

I have a list of geo coordinates in csv I can use that input or can write a query for search.

Any way to complete the scraping with location will be appreciated.

Appreciate the help !!


Get this bounty!!!

#StackBounty: #python #elementtree #xml.etree XML ElementTree – indexing tags

Bounty: 50

I have a XML file:

<sentence id="en_BlueRibbonSushi_478218345:2">
   <text>It has great sushi and even better service.</text>
</sentence>
<sentence id="en_BlueRibbonSushi_478218345:3">
   <text>The entire staff was extremely accomodating and tended to my every need.</text>
</sentence>
<sentence id="en_BlueRibbonSushi_478218345:4">
   <text>I&apos;ve been to this restaurant over a dozen times with no complaints to date.</text>
</sentence>

Using XML ElementTree, I would like to insert a tag <Opinion> that has an attribute category=. Say I have a list of chars list = ['a', 'b', 'c'], is it possible to incrementally asign them to each text so I have:

<sentence id="en_BlueRibbonSushi_478218345:2">
   <text>It has great sushi and even better service.</text>
   <Opinion category='a' />
</sentence>
<sentence id="en_BlueRibbonSushi_478218345:3">
   <text>The entire staff was extremely accomodating and tended to my every need.</text>
   <Opinion category='b' />
</sentence>
<sentence id="en_BlueRibbonSushi_478218345:4">
   <text>I&apos;ve been to this restaurant over a dozen times with no complaints to date.</text>
   <Opinion category='c' />
</sentence>

I am aware I can use the sentence id attribute but this would require a lot of restructuring of my code. Basically, I’d like to be able to index each sentence entry to align with my list index.


Get this bounty!!!

#StackBounty: #python #performance #benchmarking #pypy Accurately testing Pypy vs CPython performance

Bounty: 50

The Problem Description:

I have this custom “checksum” function:

NORMALIZER = 0x10000


def get_checksum(part1, part2, salt="trailing"):
    """Returns a checksum of two strings."""

    combined_string = part1 + part2 + " " + salt if part2 != "***" else part1
    ords = [ord(x) for x in combined_string]

    checksum = ords[0]  # initial value

    # TODO: document the logic behind the checksum calculations
    iterator = zip(ords[1:], ords)
    checksum += sum(x + 2 * y if counter % 2 else x * y
                    for counter, (x, y) in enumerate(iterator))
    checksum %= NORMALIZER

    return checksum

Which I want to test on both Python3.6 and PyPy performance-wise. I’d like to see if the function would perform better on PyPy, but I’m not completely sure, what is the most reliable and clean way to do it.

What I’ve tried and the Question:

Currently, I’m using timeit for both:

$ python3.6 -mtimeit -s "from test import get_checksum" "get_checksum('test1' * 100000, 'test2' * 100000)"
10 loops, best of 3: 329 msec per loop

$ pypy -mtimeit -s "from test import get_checksum" "get_checksum('test1' * 100000, 'test2' * 100000)"
10 loops, best of 3: 104 msec per loop

My concern is I’m not absolutely sure if timeit is the right tool for the job on PyPy because of the potential JIT warmup overhead.

Plus, the PyPy itself reports the following before reporting the test results:

WARNING: timeit is a very unreliable tool. use perf or something else for real measurements
pypy -m pip install perf
pypy -m perf timeit -s 'from test import get_checksum' "get_checksum('test1' * 1000000, 'test2' * 1000000)"

What would be the best and most accurate approach to test the same exact function performance across these and potentially other Python implementations?


Get this bounty!!!

#StackBounty: #python #numpy #matplotlib #fractals Plotting the Mandelbrot set at different zoom levels

Bounty: 50

I’m interested in making an animated movie of a zoom in on a part of the Mandelbrot set. My code works well for a few zooms, but upon trying to zoom in quite far, I find that the fractal becomes “smoothed out”. Am I missing something that prevents me from seeing the fractal structure at higher zoom levels? Am I hitting machine precision in the computations? Here’s what I’m talking about:

Zoom level 3:
zoom level 3
Zoom level 9:
zoom level 9
Zoom level 20:
zoom level 20

The first plot looks good, the second is okay, and the third is not fractal at all.

If there are any other deficiencies or improvements I’d be glad to hear about them as well.

Here’s my code:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib import animation
import time

# some interesting places in the set
# http://www.nahee.com/Derbyshire/manguide.html

N = 500
nIts = 25
nZooms = 50
x0=0
y0=-1

movie = np.zeros([N,N,nZooms])

x = np.linspace(-2,1,N)
y = np.linspace(-1,1,N)
X,Y = np.meshgrid(x,y)
c = X + 1j*Y
z = 0*c
for i in range(nIts):
    z = z**2 + c

mask = np.abs(z) < 1
z[z>1]=0
z[np.isnan(z)]=0
movie[:,:,0] = mask


# plotting stuff
for j in range(1,nZooms):
    h=1./(2**j)
    print "Plot number ", j
    x = x0+h*np.linspace(-1,1,N)
    y = y0+h*np.linspace(-1,1,N)
    X,Y = np.meshgrid(x,y)
    c = X + 1j*Y
    z = 0*c
    for i in range(nIts):
        z = z**2 + c

    mask = np.abs(z) < 1
    z[z>1]=0
    z[np.isnan(z)]=0
    movie[:,:,j] = mask

fig = plt.figure()

for j in range(nZooms):
    name = "image%d.png" % j
    plt.imshow(movie[:,:,j], cmap = 'RdBu')
    plt.gray()
    plt.axis('equal')
    plt.axis('off')
    # plt.show()
    fig.savefig(name)
    time.sleep(1)


Get this bounty!!!