#StackBounty: #machine-learning #python #data-mining #text-mining #topic-model Compare two topic modelling sets

Bounty: 50

I have two sets of newspaper articles where I train the first newspaper dataset separately to get the topics per each newspaper article.

E.g., first newspaper dataset
article_1 = {'politics': 0.1, 'nature': 0.8, ..., 'sports':0, 'wild-life':1}

Again, I train my second newspaper dataset (from a different distributor) to get the topics per each newspaper article.

E.g., second newspaper dataset (from a different distributor)
article_2 = {'people': 0.3, 'animals': 0.7, ...., 'business':0.7, 'sports':0.2}

As shown in the examples, the topics I get from the two datasets are different, thus I manually matched similar topics based on their frequent words.

I want to identify whether the two newspaper distributors publish the same news in every week.

Hence, I am interested in knowing if there is a systematic way of comparing the topics across two corpora and measuring their similarity. Please help me.


Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #conv-neural-network #hyperparameter Training Neural Net with examples it misclassified

Bounty: 50

So I have a net which is working pretty well(93%+ on the validation set which is the state of the art[https://yoniker.github.io/]) on some problem.

I want to squeeze even more performance out of it, so I intentionally took examples it misclassified (I thought that those examples will get it closer to the true hypothesis as the gradient is proportional to the loss which is higher for mispredicted examples,and the “price” in terms of time of getting those kind of examples is almost the same as getting any example,mispredicted or not).

  • What hyperparameters (learning rate in particular) should I use when it comes to the new examples? (the gradient is bigger so the ones which i previously found are not working anymore).
  • Should I search again for new hyperparameters for the ‘new’ problem (training more a trained net)?
  • Should I use the previous examples as well?
  • If so, what should be the ratio between the ‘old’ examples and the ‘new’ ones?
  • Are there known and proved methods for this particular situation?


Get this bounty!!!

#StackBounty: #data-request #machine-learning #historical #sports Historic (start of 2017 season) Formula 1 betting odds

Bounty: 50

I’m looking for historic (i.e. from the start of 2017 season onwards) betting odds for Formula 1 races. Specifically, I’m looking for data such as winning odds for each driver (and/or team) for each race, podium odds and generally as more data as possible.

I’ve built a machine-learning F1 race prediction engine and I want to check if my predictions could somehow guide any betting endeavours.


Get this bounty!!!

#StackBounty: #regression #machine-learning Artificial neural networks EQUIVALENT to linear regression with polynomial features?

Bounty: 100

I want to improve my understanding of neural networks and their benefits compared to other machine learning algorithms. My understanding is as below and my question is:

Can you correct and supplement my understanding please? 🙂

My understanding:

(1) Artificial neural networks = A function, which predicts output values from input values. According to a Universal Approximation Theorem (https://en.wikipedia.org/wiki/Universal_approximation_theorem), you usually can have any possible (though it should behave well) prediction function, given enough neurons.

(2) The same is true for linear regression, by taking polynomials of the input values as additional input values, since you can approximate (compare Taylor expansion) each function well by polynomials.

(3) This means, that (in a sense, with respect to best possible outcomes), those 2 methods are equivalent.

(4) Hence, their main difference lies in which method lends itself to better computational implementation. In other words, with which method can you find, based on training examples, faster good values for the parameters which eventually define the prediction function.

I welcome any thoughts, comments and recommendations to other links or books to improve my thinking.


Get this bounty!!!

#StackBounty: #machine-learning Group multiple transport orders to a tour and assign it to available vehicles based on previous behavior

Bounty: 100

It’s difficult to say what I want to achieve in just a title. I hope it’s not too misleading. If someone knows a better title, feel free to edit it.

What I want to do is grouping multiple transport orders for tour planning in logistics and assign those groups to some vehicle.

So I think I need to classify the vehicles based on for example their current location, the time the driver is allowed to drive, the space on the vehicle and so on.

Next I need to group my transport orders to k tours where k is the number of available vehicles so that every vehicle has just as many transport orders as it can handle. The grouping maybe considers something like “who is the customer”(priority), “where is the freight shipped from”, “where should it be shipped to” and so on.

Update: I don’t want to find the (nearly) optimal solution by solving the tsp, but the solution the human would have chosen like he did before. Therefore I have to learn from previous data instead of optimizing it.

Can anyone point me into a direction of one or more algorithms to look at?

I have a lot of data available to learn from:
About 2 million planned tours with their corresponding transport orders and the assigned vehicles.


Get this bounty!!!

#StackBounty: #machine-learning #svm #optimization #nonlinear-regression #gradient-descent What are some machine learning problems that…

Bounty: 100

I am working on continuous vector optimization, and hence continuous multiobjective optimization is a particular case. I am interested in finding applications in machine learning for this problems. Is there any context on which you could use Pareto/weak Pareto optimal points to learn the optimal parameters? I was wondering whether there exists something like a multiobjective version of logistic regression, least squares or SVMs. So, the idea is to select a Pareto point of a function $f$ which is decomposable as
$$f(theta)= sum_{i=1}^m f_i(theta),$$ where $f_i:Bbb R^n to Bbb R^k.$
Any pointer to the literature will also be very helpful.


Get this bounty!!!

#StackBounty: #machine-learning #deep-learning #gaussian-process #latent-variable #variational-bayes Expectation of Covariance Matrix f…

Bounty: 100

I am currently reading the paper entitled “Variational Auto-Encoded Deep Gaussian Processes” by Dai et al, a copy of which may be found here.

The paper proposes a stacking of Gaussian Process Latent Variable Models in an ANN like fashion and introducing a variational distribution paramaterised by a Multilayer Perceptron to aid tractability.

Before asking my question, I shall introduce some preliminaries for your convenience(paraphrased from the paper). If this is too long, I am happy to shorten it and rely more on the paper as a reference.

The marginal likelihood of the DGP model is given in Equation 4 in the paper as

$p(displaystyle mathbf{Y}) = int p(mathbf{Y} mid mathbf{X}{1}) prod{l=2}^{L} p(mathbf{X}{l-1} mid mathbf{X}{l})p(mathbf{X}{L})dmathbf{X}{1} ldots dmathbf{X}_{L}$

where $mathbf{Y} in mathbb{R}^{N times D}$ is a matrix of observed data, $L$ is the number of layers of latent variables, for layer $l$, $mathbf{X}{l} in mathbb{R}^{N times Q{l}}$ is a latent space representation of feature dimensionality $Q_{l} < Q_{l-1}$.

For layer 1, the input is $mathbf{Y}$ so for $l=2$, $mathbf{X}_{l-1} = mathbf{Y}$

A variational lower bound of the above log marginal distribution is given as per Equation 5 in the paper

$displaystyle mathcal{L} =
sum_{l=1}^{L} mathop{mathbb{E}}big[log p(mathbf{X}{l-1}midmathbf{X}{l})big]{q(mathbf{X}{l-1})q(mathbf{X}{l})} +
sum
{l=1}^{L-1} H(q(mathbf{X}{l})) – KL(q(mathbf{X}{L}) midmid p(mathbf{X}_{L}))$

Where $H(.)$ is the Shannon Entropy and $KL(. midmid .)$ is the Kullback Leibler divergence. To make inference tractable, the variational distribution $q(.)$ is introduced above, as is defined as follows

$displaystyle q(mathbf{X}{l}) =
prod
{n=1}^{N} mathcal{N}(mathbf{x}^{(n)}{l} mid boldsymbolmu^{(n)}{l}, boldsymbolSigma^{(n)}_{l})$

where $boldsymbolmu_{l}(.)$ is the output of a Multilayer Perceptron taking $mathbf{X}{l}$ as input and $boldsymbolSigma{l}$ is the posterior variance, assumed to be diagonal and the same over all datapoints.

However, at this point the variational lower bound is still intractable, due to the expectation in the first term of $mathcal{L}$. As such, auxiliary variables $mathbf{U}{l} in mathbb{R}^{M times Q{l}}$ are introduced, as are noise free observations $mathbf{F}{l} in mathbb{R}^{N times Q{l-1}}$ and the first term of $mathcal{L}$ reformulated as follows(though at this point, it it not clear to me how $mathbf{F}{l}$ differs from $mathbf{X}{l-1}$), see Equation 10 in the paper

$displaystylemathop{mathbb{E}}big[log p(mathbf{X}{l-1}midmathbf{X}{l})big]{q(mathbf{X}{l-1})q(mathbf{X}{l})} geq
mathop{mathbb{E}}big[ log p(mathbf{X}
{l-1} mid mathbf{F}{l}) – KL(q(mathbf{U}{l} mid mathbf{X}{l-1}) midmid p(mathbf{U}{l})) big]{p(mathbf{F}{l} mid mathbf{U}{l}, mathbf{X}{l})q(mathbf{U}{l} mid mathbf{X}{l-1})q(mathbf{X}{l-1})q(mathbf{X}{l})}$

Finally, the authors give a distributed(in the parallel computation sense) form of the variational lower bound $mathcal{L}$ with the first term taking the following form

$displaystyle Trbig(mathop{mathbb{E}}big[ mathbf{X}^{T}{l-1}mathbf{X}{l-1} big]{q(mathbf{X}{t-1})}big) =
sum_{n=1}^{N} (boldsymbolmu^{(n)}{l-1})^{T}boldsymbolmu{l-1}^{(n)} +
Trbig( boldsymbolSigma_{l-1}^{(n)} big)$

and the second term taking the following form

$displaystyle Trbig( boldsymbolLambda_{l}^{-1}boldsymbolPsi_{l}^{T} mathop{mathbb{E}}big[mathbf{X}{l-1}mathbf{X}{l-1}big]{q(mathbf{X}{l-1})} boldsymbolPsi_{l} big) =
Trbig(boldsymbolLambda_{l}^{-1}(boldsymbolPsi_{l}^{T}mathbf{R}{l-1}^{T})(boldsymbolPsi{l}mathbf{R}{l-1})big)
+ Trbigg(boldsymbolLambda
{l}^{-1} bigg(sum_{n=1}^{N}boldsymbolPsi_{l}^{(n)}alpha_{l-1}^{(n)}bigg)bigg(sum_{n=1}^{N}boldsymbolPsi_{l}^{(n)}alpha_{l-1}^{(n)}bigg)^{T}bigg)$

where $boldsymbolLambda_{l} = mathbf{K}{mathbf{U}{l}mathbf{U}{l}} + mathop{mathbb{E}}big[mathbf{K}^{T}{mathbf{F}{l}mathbf{U}{l}}mathbf{K}{mathbf{F}{l}mathbf{U}{l}}big]{q(mathbf{X}{l})}$ for covariance matrix $mathbf{K}{<…>}$ generated by a covariance kernel such as the exponentiated quadratic. Similarly, $boldsymbolPsi_{l} = mathop{mathbb{E}}big[mathbf{K}{mathbf{F}{l}mathbf{U}{l}}big]{q(mathbf{X}_{l})}$.

Additionally, $mathbf{R}{l-1} = big[(boldsymbolmu^{(1)}{l-1})^{T} dots (boldsymbolmu^{(N)}{l-1})^{T} big]$ and $alpha{l-1}^{(n)} = sqrt{Trbig(boldsymbolSigma_{l-1}^{(n)}big)}$ and $mathbf{A}{l=1} = diagbig(alpha{l-1}^{(1)} dots alpha_{l-1}^{N}big)$

Finally, the second term of the above distributed form is obtained by making the following observation

$mathop{mathbb{E}}big[ mathbf{X}{l-1}mathbf{X}{l-1}^{T} big] = mathbf{R}{l-1}^{T}mathbf{R}{l-1} + mathbf{A}{l-1}mathbf{A}{l-1}$

Points of Confusion – Question(s)

Given the above formulation, there are a few computational issues that I am at present not entirely clear on.

Firstly, how in general does one take the expectation of a covariance matrix generated by a kernel such as the exponentiated quadratic?

For example, in the above formulation, the computation of the quantity $boldsymbolPsi_{l}$ is not clear to me. By the definitions given in the paper, to evaluate this quantity one takes the expectation w.r.t. the variational distribution $q(mathbf{X}_{l})$, which itself is defined to be the product of Normal PDF evaluations of the GP latent variables given their MLP encodings – a scalar.

It is not clear to me how to use this to take the expectation of an arbitrary covariance/cross-covariance matrix $mathbf{K} in mathbb{R}^{N times M}$. As such, the evaluation of the quantity defined by the expectation in the second term of $boldsymbolLambda_{l}$ is also unclear to me as we are again attemnpting to take the expectation of a cross-covariance matrix w.r.t. the variational distribution $q(mathbf{X}_{l})$.

My central question is, given a formulation like the one set out above, how does one handle these expectation terms?

In addition, if anybody is familiar with this work and/or area of research, it would also be of great assistance to clarify the form of $mathbf{F}{l}$ and $mathbf{U}{l}$ for an arbitrary layer. To my understanding, for $l=1$, the data layer, $mathbf{F}$ is a subset of the data points(removed from the dataset) and $mathbf{U}$ the corresponding latent variables. However, what about arbitrary $l$?

Any assistance would be greatly appreciated.


Get this bounty!!!

#StackBounty: #machine-learning #deep-learning #normalization #loss-functions DNN: Mapping a fixed length string to another fixed lengt…

Bounty: 50

I have a situation where I’d like a DNN to learn the [unknown] mapping between two fixed-length strings. A [simplistic] example:

"-+--++-++-" -> "968"
"+-+-+-+-+-" -> "185"
"-+-+-+++--" -> "766"

I can normalize the characters in the input string to convert them into the numerical inputs required by the DNN but I’m not sure how to structure the output layer (I’m using Keras).

Assuming the output string is 3 characters long:

model = models.Sequential()
model.add(layers.Dense(x1, activation='relu', input_shape=(N,)))
model.add(layers.Dense(x2, activation='relu'))
...
model.add(layers.Dense(3, activation='???'))
  1. I’ll need a way to convert the 3 outputs back into characters.
  2. I can’t seem to figure out which activation function I should use.
  3. I’ll also need to choose appropriate optimizer and loss functions

model.compile(optimizer='???', loss='???')


Get this bounty!!!

#StackBounty: #machine-learning #survey #weighted-sampling #stratification Machine learning with weighted / complex survey data

Bounty: 150

I have worked a lot with various nationally representative data. These data sources have a complex survey design, so the analysis requires the specification of stratification and weight variables. Among the data sources that are within my area of study, machine learning tools have not been applied to them. One obvious reason is that machine learning methods (currently) do not take into account weight and stratification variables.

The goal of the weighted / stratified analyses is to obtain adjusted population estimates, which is different than the goal / purpose of machine learning. What thoughts do people have about using the nationally representative data sources and ignoring the weight and stratification variables? In other words, what would be your thoughts if you reviewing a machine learning study that was used nationally representative data but ignored the weight and stratification variables, assuming that the researcher / author was up-front about this methodological decision and was not making claims of nationally representative results?

Thanks in advance!


Get this bounty!!!

#StackBounty: #machine-learning #prediction #semi-supervised Difference between semi-supervised learning and prediction?

Bounty: 50

What is the difference between semi-supervised learning and prediction? It seems to me they’re the same (both are predicting the label)


Get this bounty!!!