#StackBounty: #python #mongodb running mongo queries against data in memory

Bounty: 200

I have a mongodb collection against which I need to run many count operations (each with a different query) every hour. When I first set this up, the collection was small, and these count operations ran in approx one minute, which was acceptable. Now they take approx 55 minutes, so they’re running nearly continuously.

The query associated with each count operation is rather involved, and I don’t think there’s a way to get them all to run with indices (i.e. as COUNT_SCAN operations).

The only feasible solution I’ve come up with is to:

  • Run a full collection scan every hour, pulling every document out of the db
  • Once each document is in memory, run all of the count operations against it myself

Without my solution the server is running dozens and dozens of full collection scans each hour. With my solution the server is only running one. This has led me to a strange place where I need to take my complex queries and re-implement them myself so I can come up with my own counts every hour.

So my question is whether there’s any support from mongo drivers (pymongo in my case, but I’m curious in general) in interpreting query documents but running them locally against data in memory, not against data on the mongodb server.

Initially this felt like an odd request, but there’s actually quite a few places where this approach would probably greatly lessen the load on the database in my particular use case. So I wonder if it comes up from time to time in other production deployments.

Get this bounty!!!

#StackBounty: #python #random-forest #natural-language How to train a model to recognize the pronunciation of a syllabe

Bounty: 50

I would like to train a model able to predict the syllabe pronunciation of french words.

So, I created a set of syllabized word and for each syllabe I’ve its pronunciation code. Example: The word parkinson is syllabised like this : par-kin-son and its pronunciation code is paR-kin-sOn. I choosed a english word for clarity but in reality I’ve a training set of french word.

First, I’ve created a table of symbols for syllabes:

symbols_input = [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
                 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w',
                 'x', 'y', 'z', 'à', 'â', 'ç', 'è', 'é', 'ê', 'î', 'ï', 'ô',
                 'û', 'ü']

And for pronunciation codes:

symbols_output = [' ', '1', '2', '5', '8', '9', '@', 'E', 'G', 'N', 'O', 'R',
                  'S', 'Z', 'a', 'b', 'd', 'e', 'f', 'g', 'i', 'j', 'k', 'l',
                  'm', 'n', 'o', 'p', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
                  '§', '°']

What I do after that is convert my syllabes and pronunciation code into vectors by using the following function:

def convert_syllabe_2_vector(syllabe, table):
    """Helper function to convert a syllabe into a vector of integer values."""
    return [table[letter] for letter in syllabe]


>>> convert_syllabe_2_vector('par', symbol_input)
16 1 18
>>> convert_syllabe_2_vector('par', symbol_output)
27 14 11

Since the longest syllabe in my set is 8 char, and the longest pronunciation code is 6, the next thing I do is padding my vector to zero like this:
16 01 18 00 00 00 00 00 and 27 14 11 00 00 00

Because in french, previous syllabe and next syllabe can have influences on the pronunciation I form a bigger of 8 * 3 dimensions. So, for par syllabe, it will give me:

# empty                 par                    -kin
00 00 00 00 00 00 00 00 16 01 18 00 00 00 00 00 11 09 14 00 00 00 00 00 

and the correct prediction paR:

27 14 11 00 00 00

Do you know a better approach ? Is it right to say it’s a classification problem and not a regression one ? What kind of model could fit properly ? I tried a Random Forest Classifier it seems to work but I suspect there is many overfitting and training uses a lot of RAM. I also tried to normalize my vector by dividing them by 38, the number of symbols of each table and then multiplying the predicted output by 38 again but the results I obtained were terribles.

Get this bounty!!!

#StackBounty: #java #python #pbkdf2 #hmacsha1 #hashlib Python equivalent of java PBKDF2WithHmacSHA1

Bounty: 50

I’m tasked with building a consumer of an API that requires an encrypted token with a seed value that is the UNIX time. The example I was shown was implemented using Java which I’m unfamiliar with, and after reading through documentation and other stack articles have been unable to find a solution.

Using the javax.crypto.SecretKey, javax.crypto.SecretKeyFactory, javax.crypto.spec.PBEKeySpec, and javax.crypto.spec.SecretKeySpec protocols, I need to generate a token similar to the below:

public class EncryptionTokenDemo {

public static void main(String args[]) {
    long millis = System.currentTimeMillis();
    String time = String.valueOf(millis);
    String secretKey = "somekeyvalue";
    int iterations = 12345;
    String iters = String.valueOf(iterations);
    String strToEncrypt_acctnum = "somevalue|" + time + "|" + iterations;

    try {

        byte[] input = strToEncrypt_acctnum.toString().getBytes("utf-8");
        byte[] salt = secretKey.getBytes("utf-8");
        SecretKeyFactory factory = SecretKeyFactory.getInstance("PBKDF2WithHmacSHA1");
        SecretKey tmp = factory.generateSecret(new PBEKeySpec(secretKey.toCharArray(), salt, iterations, 256));
        SecretKeySpec skc = new SecretKeySpec(tmp.getEncoded(), "AES");
        Cipher cipher = Cipher.getInstance("AES/ECB/PKCS5Padding");
        cipher.init(Cipher.ENCRYPT_MODE, skc);
        byte[] cipherText = new byte[cipher.getOutputSize(input.length)];
        int ctLength = cipher.update(input, 0, input.length, cipherText, 0);
        ctLength += cipher.doFinal(cipherText, ctLength);
        String query = Base64.encodeBase64URLSafeString(cipherText);
        // String query = cipherText.toString();
        System.out.println("The unix time in ms is :: " + time);
        System.out.println("Encrypted Token is :: " + query);
    } catch (Exception e) {
        System.out.println("Error while encrypting :" + e);




Should I be using the built-in library hashlib to implement something like this? I can’t really find documentation for implementing a PBKDF2 encryption with iterations/salt as inputs. Should I be using pbkdf2? Sorry for the vague questions, I’m unfamiliar with the encryption process and feel like even just knowing what the correct constructor would be is a step in the right direction.

Get this bounty!!!

#StackBounty: #python #statistics Cross-validating Fixed-effect Model

Bounty: 50

Code Summary

The following code is a data science script I’ve been working on that cross-validates a fixed effect model. I’m moving from R to Python and would appreciate feedback on the code below.

The code does the following:

  1. Split data into train and test using a custom function that groups/clusters the data

  2. Estimate a linear fixed effect model with train and test data

  3. Calculate RMSE and tstat to verify independence of residuals

  4. Prints RMSE, SE, and tstat from cross-validation exercise.

Note: the code downloads a remote data set, so the code can be run on its own.


from urllib import request
from scipy import stats
import pandas as pd
import numpy as np
import statsmodels.api as sm

print("Defining functions......")

def main():
    Estimate baseline and degree day regression.

        data.frame with RMSE, SE, and tstats
    # Download remote from github
    print("Downloading custom data set from: ")
    file_url = "https://github.com/johnwoodill/corn_yield_pred/raw/master/data/full_data.pickle"
    request.urlretrieve(file_url, "full_data.pickle")
    cropdat = pd.read_pickle("full_data.pickle")

    # Baseline WLS Regression Cross-Validation with FE and trends
    print("Estimating Baseline Regression")
    basedat = cropdat[['ln_corn_yield', 'trend', 'trend_sq', 'corn_acres']]
    fe_group = pd.get_dummies(cropdat.fips)
    regdat = pd.concat([basedat, fe_group], axis=1)
    base_rmse, base_se, base_tstat = felm_cv(regdat, cropdat['trend'])

    # Degree Day Regression Cross-Validation
    print("Estimating Degree Day Regression")
    dddat = cropdat[['ln_corn_yield', 'dday0_10C', 'dday10_30C', 'dday30C',
                     'prec', 'prec_sq', 'trend', 'trend_sq', 'corn_acres']]
    fe_group = pd.get_dummies(cropdat.fips)
    regdat = pd.concat([dddat, fe_group], axis=1)
    ddreg_rmse, ddreg_se, ddreg_tstat = felm_cv(regdat, cropdat['trend'])

    # Get results as data.frame
    fdat = {'Regression': ['Baseline', 'Degree Day',],
            'RMSE': [base_rmse, ddreg_rmse],
            'se': [base_se, ddreg_se],
            't-stat': [base_tstat, ddreg_tstat]}

    fdat = pd.DataFrame(fdat, columns=['Regression', 'RMSE', 'se', 't-stat'])

    # Calculate percentage change
    fdat['change'] = (fdat['RMSE'] - fdat['RMSE'].iloc[0])/fdat['RMSE'].iloc[0]
    return fdat

def felm_rmse(y_train, x_train, weights, y_test, x_test):
    Estimate WLS from y_train, x_train, predict using x_test, calculate RMSE,
    and test whether residuals are independent.

        y_train: Dep variable - Full or training data
        x_train: Covariates - Full or training data
        weights: Weights for WLS
        y_test: Dep variable - test data
        x_test: Covariates - test data

        Returns tuple with RMSE and tstat from ttest
    # Fit model and get predicted values of test data
    mod = sm.WLS(y_train, x_train, weights=weights).fit()
    pred = mod.predict(x_test)

    #Get residuals from test data
    res = (y_test[:] - pred.values)

    # Calculate ttest to check residuals from test and train are independent
    t_stat = stats.ttest_ind(mod.resid, res, equal_var=False)[0]

    # Return RMSE and t-stat from ttest
    return (np.sqrt(np.mean(res**2)), t_stat)

def gc_kfold_cv(data, group, begin, end):
    Custom group/cluster data split for cross-validation of panel data.
    (Ensure groups are clustered and train and test residuals are independent)

        data:     data to filter with 'trend'
        group:    group to cluster
        begin:    start of cluster
        end:      end of cluster

        Return test and train data for Group-by-Cluster Cross-validation method
    # Get group data
    data = data.assign(group=group.values)

    # Filter test and train based on begin and end
    test = data[data['group'].isin(range(begin, end))]
    train = data[~data['group'].isin(range(begin, end))]

    # Return train and test
    dfs = {}
    tsets = [train, test]

    # Combine train and test to return dfs
    for i, val in enumerate([1, 2]):
        dfs[val] = tsets[i]

    return dfs

def felm_cv(regdata, group):
    Cross-validate WLS FE model

        regdata:  regression data
        group:    group fixed effect

        return mean RMSE, standard error, and mean tstat from ttest
    # Loop through 1-31 years with 5 groups in test set and 26 train set
    #i = 1
    #j = False
    retrmse = []
    rettstat = []
    #for j, val in enumerate([1, 27]):
    for j in range(1, 28):
        # Get test and training data
        tset = gc_kfold_cv(regdata, group, j, j + 4)

        # Separate y_train, x_train, y_test, x_test, and weights
        y_train = tset[1].ln_corn_yield
        x_train = tset[1].drop(['ln_corn_yield', 'corn_acres'], 1)
        weights = tset[1].corn_acres
        y_test = tset[2].ln_corn_yield
        x_test = tset[2].drop(['ln_corn_yield', 'corn_acres'], 1)

        # Get RMSE and tstat from train and test data
        inrmse, t_stat = felm_rmse(y_train, x_train, weights, y_test, x_test)

        # Append RMSE and tstats to return

        # If end of loop return mean RMSE, s.e., and tstat
        if j == 27:
            return (np.mean(retrmse), np.std(retrmse), np.mean(t_stat))

if __name__ == "__main__":
    RDAT = main()

    # print results
    print("Baseline: ", round(RDAT.iloc[0, 1], 2), "(RMSE)",
          round(RDAT.iloc[0, 2], 2), "(se)",
          round(RDAT.iloc[0, 1], 3), "(t-stat)")
    print("Degree Day: ", round(RDAT.iloc[1, 1], 2), "(RMSE)",
          round(RDAT.iloc[0, 2], 2), "(se)",
          round(RDAT.iloc[1, 3], 2), "(t-stat)")
    print("% Change from Baseline: ", round(RDAT.iloc[1, 4], 4)*100, "%")

Get this bounty!!!

#StackBounty: #python #inheritance #pickle #defaultdict Can't pickle recursive nested defaultdict

Bounty: 50

I have a recursive nested defaultdict class defined as

from collections import defaultdict

class NestedDict(defaultdict):
    def __init__(self):

sitting in a nested_dict.py file.

When I try to pickle it, e.g.

import pickle
from nested_dict import NestedDict

d = NestedDict()

I get TypeError: __init__() takes 1 positional argument but 2 were given.

What’s exactly happening here?

Get this bounty!!!

#StackBounty: #python #numba Create Array of Dates in Numba?

Bounty: 50

I would like to create an array of dates in a Numba function, running in nopython mode.

I can’t see a date type, so I am trying NPDatetime.

My attempted code is:

import numba as nb
import numpy as np

def xxx():
    return np.empty(10, dtype=nb.types.NPDatetime('D'))


However, the code returns this error:

Unknown attribute 'NPDatetime' of type Module(<module 'numba.types' from '/home/xxx/anaconda3/lib/python3.6/site-packages/numba/types/__init__.py'>)

My numba version is 0.39.0

Get this bounty!!!

#StackBounty: #python #python-3.x #parsing #unit-testing #pandas Pytest fixture for testing a vertex-parsing function

Bounty: 50

I have just started using pytest and I am still getting used to how they do things. It seems like fixtures are at the core of the library, and that they can be used for making small pieces of dummy data that will get reused. I see that there are other methods for handling large dummy data. I have the following test code which tests a module I wrote called generate_kml.

import pytest
import generate_kml as gk
import pandas

def line_record():
    return pandas.Series({gk.DB_VERTICES: "LINESTRING(1.1 1.1,2.2 2.2)"})

def test_convert_wkt_to_coords(line_record):
    expected = pandas.Series({gk.DB_VERTICES: [("1.1", "1.1"), ("2.2", "2.2")]})
    assert gk.convert_wkt_vertices_to_coords(line_record).equals(expected)

I am wondering if this is the way fixtures are meant to be used; to set up small reused data. (I plan to use the line_record multiple times in the test file). Additionally, I am wondering about the readability or redundancy of assigning the expected value to expected. If I directly compared the two Series, the line would exceed PEP8’s recommended line length, so I would break it into two lines anyway. If it adds readability here, then would it be good practice to always assign the expected value to a variable called expected (assuming you are comparing values that are expected to be equal)?
Here is the function being tested from generate_kml:

def convert_wkt_vertices_to_coords(vertices_as_wkt):
    def parse_coords(wkt):
        wkt = wkt[wkt.find("(") + 1:wkt.find(")")]
        coords = wkt.split(",")
        coords = [tuple(x.split(" ")) for x in coords]
        return coords
    return vertices_as_wkt.apply(parse_coords)

One more thing here; in convert_wkt_vertices_to_coords I have a nested function. I don’t plan on reusing it, but in the past I haven’t had a need for nested functions, so it feels a bit off to me. Should I leave it as a nested function or break it out as its own function in the module?

Get this bounty!!!

#StackBounty: #python #linux #memory-leaks Python memory not being released on linux?

Bounty: 50

I am trying to load a large json object into memory and then perform some operations with the data. However, I am noticing a large increase in RAM after the json file is read –EVEN AFTER the object is out of scope.

Here is the code

import json
import objgraph
import gc
from memory_profiler import profile
def open_stuff():
    with open("bigjson.json", 'r') as jsonfile:
        d= jsonfile.read()
        jsonobj = json.loads(d)
        del jsonobj
        del d
    print ('d')


I tried running this script in Windows with Python version 2.7.12 and Debian 9 with Python version 2.7.13, and I am seeing an issue with the Python in Linux.

In Windows, when I run the script, it uses up a lot of RAM while the json object is being read and in scope (as expected), but it is released after the operation is done (as expected).

list                       3039184
dict                       413840
function                   2200
wrapper_descriptor         1199
builtin_function_or_method 819
method_descriptor          651
tuple                      617
weakref                    554
getset_descriptor          362
member_descriptor          250
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
     5     16.9 MiB     16.9 MiB   @profile
     6                             def open_stuff():
     7     16.9 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    197.9 MiB    181.0 MiB           d= jsonfile.read()
     9   1393.4 MiB   1195.5 MiB           jsonobj = json.loads(d)
    10   1397.0 MiB      3.6 MiB           objgraph.show_most_common_types()
    11    402.8 MiB   -994.2 MiB           del jsonobj
    12    221.8 MiB   -181.0 MiB           del d
    13    221.8 MiB      0.0 MiB       print ('d')
    14     23.3 MiB   -198.5 MiB       gc.collect()

However in the LINUX environment, over 500MB of RAM is still used even though all references to the JSON object has been deleted.

list                       3039186
dict                       413836
function                   2336
wrapper_descriptor         1193
builtin_function_or_method 765
method_descriptor          651
tuple                      514
weakref                    480
property                   273
member_descriptor          250
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
     5     14.2 MiB     14.2 MiB   @profile
     6                             def open_stuff():
     7     14.2 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    195.1 MiB    181.0 MiB           d= jsonfile.read()
     9   1466.4 MiB   1271.3 MiB           jsonobj = json.loads(d)
    10   1466.8 MiB      0.4 MiB           objgraph.show_most_common_types()
    11    694.8 MiB   -772.1 MiB           del jsonobj
    12    513.8 MiB   -181.0 MiB           del d
    13    513.8 MiB      0.0 MiB       print ('d')
    14    513.0 MiB     -0.8 MiB       gc.collect()

The same script run in Debian 9 with Python 3.5.3 uses less RAM but leaks a proportionate amount of RAM.

list                       3039266
dict                       414638
function                   3374
tuple                      1254
wrapper_descriptor         1076
weakref                    944
builtin_function_or_method 780
method_descriptor          780
getset_descriptor          477
type                       431
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
     5     17.2 MiB     17.2 MiB   @profile
     6                             def open_stuff():
     7     17.2 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    198.3 MiB    181.1 MiB           d= jsonfile.read()
     9   1057.7 MiB    859.4 MiB           jsonobj = json.loads(d)
    10   1058.1 MiB      0.4 MiB           objgraph.show_most_common_types()
    11    537.5 MiB   -520.6 MiB           del jsonobj
    12    356.5 MiB   -181.0 MiB           del d
    13    356.5 MiB      0.0 MiB       print ('d')
    14    355.8 MiB     -0.8 MiB       gc.collect()

What is causing this issue?
Both versions of Python are running 64bit versions.

EDIT – calling that function several times in a row leads to even stranger data, the json.loads function uses less RAM each time it’s called, after the 3rd try the RAM usage stabilizes, but the earlier leaked RAM does not get released..

list                       3039189
dict                       413840
function                   2339
wrapper_descriptor         1193
builtin_function_or_method 765
method_descriptor          651
tuple                      517
weakref                    480
property                   273
member_descriptor          250
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
     5     14.5 MiB     14.5 MiB   @profile
     6                             def open_stuff():
     7     14.5 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    195.4 MiB    180.9 MiB           d= jsonfile.read()
     9   1466.5 MiB   1271.1 MiB           jsonobj = json.loads(d)
    10   1466.9 MiB      0.4 MiB           objgraph.show_most_common_types()
    11    694.8 MiB   -772.1 MiB           del jsonobj
    12    513.9 MiB   -181.0 MiB           del d
    13    513.9 MiB      0.0 MiB       print ('d')
    14    513.1 MiB     -0.8 MiB       gc.collect()

list                       3039189
dict                       413842
function                   2339
wrapper_descriptor         1202
builtin_function_or_method 765
method_descriptor          651
tuple                      517
weakref                    482
property                   273
member_descriptor          253
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
     5    513.1 MiB    513.1 MiB   @profile
     6                             def open_stuff():
     7    513.1 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    513.1 MiB      0.0 MiB           d= jsonfile.read()
     9   1466.8 MiB    953.7 MiB           jsonobj = json.loads(d)
    10   1493.3 MiB     26.6 MiB           objgraph.show_most_common_types()
    11    723.9 MiB   -769.4 MiB           del jsonobj
    12    723.9 MiB      0.0 MiB           del d
    13    723.9 MiB      0.0 MiB       print ('d')
    14    722.4 MiB     -1.5 MiB       gc.collect()

list                       3039189
dict                       413842
function                   2339
wrapper_descriptor         1202
builtin_function_or_method 765
method_descriptor          651
tuple                      517
weakref                    482
property                   273
member_descriptor          253
Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
     5    722.4 MiB    722.4 MiB   @profile
     6                             def open_stuff():
     7    722.4 MiB      0.0 MiB       with open("bigjson.json", 'r') as jsonfile:
     8    722.4 MiB      0.0 MiB           d= jsonfile.read()
     9   1493.1 MiB    770.8 MiB           jsonobj = json.loads(d)
    10   1493.4 MiB      0.3 MiB           objgraph.show_most_common_types()
    11    724.4 MiB   -769.0 MiB           del jsonobj
    12    724.4 MiB      0.0 MiB           del d
    13    724.4 MiB      0.0 MiB       print ('d')
    14    722.9 MiB     -1.5 MiB       gc.collect()

Filename: testjson.py

Line #    Mem usage    Increment   Line Contents
    17     14.2 MiB     14.2 MiB   @profile
    18                             def wow():
    19    513.1 MiB    498.9 MiB       open_stuff()
    20    722.4 MiB    209.3 MiB       open_stuff()
    21    722.9 MiB      0.6 MiB       open_stuff()

EDIT 2: Someone suggested this is a duplicate of Why does my program's memory not release? , but the amount of memory in question is far from the “small pages” discussed in the other question.

Get this bounty!!!

#StackBounty: #neural-networks #python #model #confusion-matrix How to interpret that my model gives no negative class prediction on te…

Bounty: 50

I have built this model with Keras :

model = Sequential()
model.add(LSTM(50, return_sequences=True,input_shape=(look_back, trainX.shape[2])))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
model.fit(trainX, trainY,validation_split=0.3, epochs=50, batch_size=1000, verbose=1)

and the results are surprising… When I do this :

trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
print(confusion_matrix(trainY, trainPredict.round()))
print(confusion_matrix(testY, testPredict.round()))

I respectively get :

[[129261      0]
[   172 129138]]


[[10822     0]
[10871     0]]

In other words, My training confusion matrix is quite fine while my testing confusion matrix classifies everybody as “positive”. What is surprising is that I have quite perfectly balanced instances, both in training and testing set…

Why do I have this ?


My “preprocessing code”, based on Jason Brownlee’s tutorial:

def create_dataset(feat,targ, look_back=1):
   dataX, dataY = [], []
   print (len(targ)-look_back-1)
   for i in range(len(targ)-look_back-1):
       a = feat[i:(i+look_back), :]
       dataY.append(targ.iloc[i + look_back])
   return np.array(dataX), np.array(dataY)

and then

look_back = 50
trainX, trainY = create_dataset(X_train_resampled,Y_train_resampled, look_back)
print ("loopback1 done")
testX, testY = create_dataset(X_test_resampled,Y_test_resampled, look_back)

I have dimension (#recordings, 50 (look_back), #features (22)) for trainX and testX

I am not sure about if this way of working is adequate ? Maybe it’s the cause of the error


Get this bounty!!!