#StackBounty: #machine-learning #distributions #dataset #survey #selection-bias Correcting Sample Selection Bias given actual Distribut…

Bounty: 100

I have two datasets, both from the same population:
The samples from the first survey are quite representative of the underlying truth. However, the second survey comes with a change in distribution due to sample selection bias.

If I merge the data and assign a class (‘surveyA’, ‘surveyB’) to each instance, it should be possible to predict from which survey an instance comes from (because of a biased distribution in ‘surveyB’). Is it good practice to simply build a model to predict and remove instances that make this classification possible?

What are ways to “correct/remove the bias in” the second dataset? How can I achieve 0.5 accuracy in classification (assuming both datasets are equally large)?

Both datasets represent surveys on political participation. SurveyB contains the data of probably more politically interested people, since they’ve participated in the first place. SurveyA can be assumed to be representative of “all people”, don’t ask me why.

Get this bounty!!!

#StackBounty: #machine-learning #self-study Build a regression model with multiple small time series

Bounty: 50

For a personal project, I’ve built a dataset of hockey players’ statistics over the time. I am looking for insights as to how I should model my predictive model (lol). The model would be used to predict how much points a player would be expected to produce with regards to his previous performances.

To keep it simple, let’s keep only the three most “important” columns for my dataset (there is a little more features than that but I don’t think they are necessary for the problem) :

PlayerId | Points | Year

Now, I have tried to use the machine learning algorithms I know but :

  • The data behaves kind of like a time series. Let’s say I have 10k players, well those players have stats over the years (sometimes from season 2005 to 2017, others from 2009-2010, well you get the point). Considering this relation between rows (For example playerId 1, Year 2005 and playerId 1, Year 2006), I can’t use most of the algorithms I know because this logic would be thrown out the window and I think it’s an important one.

  • Considering the data is related over time for some rows, I don’t think I can really model it as an unique time series. There are small time series into the dataset, per players, but I certainly don’t have enough data to treat it like such (With one row per player per year, at max I’d have 15 rows maybe for a player, which isn’t enough to build a good prediction).

Considering these two points, I’m pretty much stuck without a solution.

I’ve thought about merging all the rows in one, so I’d have :

PlayerId | Points2005 | Points2006 | etc.. but it doesn’t make much sense since we loose the notion of time.

I also considered that I could make a predictive model for all the players individually then use the weights I’d find to make another predictive model, but I’m very unsure as to how this would turn out.

I’m just looking for a small tip to push me in the right direction, whether it’s pure statistics related or a machine learning algorithm.

Get this bounty!!!

#StackBounty: #machine-learning #distributions #bayesian What is the correct formula for multinomial Naive Bayes

Bounty: 50

So far, I have seen two ways of writing multinomial NB, I was wondering which would be the correct one to use in theory?


Suppose we are going to classify the sentence

We are going really really fast


In terms of the likelihood, the two methods are described as the follows

  1. $P(we, are, going, really, really, fast|C_k) \= P(we|C_k) P(are|C_k) P(going|C_k)P(really|C_k) P(really|C_k) P(fast|C_k)$

  2. $P(we, are, going, really, really, fast|C_k) \= P(we=1,are=1,going=1,really=2,fast=1|C_k) \=
    frac{6!}{2!} P(we|C_k) P(are|C_k) P(going|C_k)P(really|C_k)^2 P(fast|C_k)$


The difference is whether it has the coefficient item of multinomial distribution. The coefficient measures the order effects.

In method one, the order matters, since we are not considering permutations of words, we are interested in only one particular word combination (the natural order).

However, for the second method, the order doesn’t matter. We are counting the word occurrences, any permutations satisfy the counts would be taken into consideration.

I am confused as they seem like to be the same method, but missing the coefficient made them like two distinct methods. How should I understand such difference?

Get this bounty!!!

#StackBounty: #machine-learning #random-forest #uncertainty #partial #partial-plot Quantifying uncertainty when fitting a statistical m…

Bounty: 100

Question: I estimate the partial dependence of $y$ on one predictor in a fitted random forest (RF). I want to now fit a parametric model to this partial dependence. How can I estimate my uncertainty when fitting this statistical model to the partial dependence estimated from the RF?

To flesh this out with an example: suppose that plant height is influenced by light, rainfall and pH, all in a nonlinear manner. I fit an RF (or other machine learning model) with height being predicted by all three. If I want to understand how light alone affects height, I can estimate its partial effect (or equivalently, the partial dependence of height on light) from the fitted RF.

Suppose that I know what this shape should look like and have an equation to describe it. I would like to fit this equation to the partial dependence estimated from the RF. Loosely speaking, I am trying to ‘filter’ the RF’s estimated partial dependence through the equation, which represents our prior understanding based on many earlier studies. I am using an RF instead of a fully parametric model because I do not know precisely how the other variables (rainfall, pH) affect height.

How can I go about estimating these parameter values in a way that captures the uncertainty in (i) the data and (ii) the fitted random forest?

I encountered a version of this idea in a post on Andrew Gelman’s stats blog. According to Gelman – who focusses on predictions from the whole model, not partial effects/dependencies – the idea has not really been developed.

I suspect that there is a bootstrap-based solution to this, but I am unsure. There may be simpler solutions that work more directly from the fitted random forest, but I am unaware of them because of an incomplete understanding of how partial effects are calculated. I’d appreciate any suggestions.

Get this bounty!!!

#StackBounty: #machine-learning #classification #binary-data #accuracy Difference between cumulative gains chart and cumulative accurac…

Bounty: 50

I am confused about the following:

Here I find the definition of cumulative gains chart as the plot of

  • x-axis: the rate of predicted positive
  • y-axis: true positive rate

It turns out that we can e.g. order by the probability of a positive outcome and plot the corresponding rate of true positives.

On the other hand there is the cumulative accuracy profile (CAP) explained here where it is said to be constructed as

The CAP of a model represents the cumulative number of positive
outcomes along the y-axis versus the corresponding cumulative number
of a classifying parameter along the x-axis.

Another source for the CAP is “Measuring the Discriminative Power of Rating Systems” by Engelmann, Hayden and Tasche.
Together with the plots I would say that the two concepts are the same. Is this true?

Get this bounty!!!

#StackBounty: #machine-learning #multimodality Confusion about multimodal machine learning

Bounty: 50

I recently browsed through this tutorial on multimodal data.

Attention: Multimodal in the sense of feature of very different type, that express the same thing

-think picture and voice of someone talking

-not in the sense of a probability distribution with multiple modes, according to the slides.

What I do not understand about the whole approach to multimodal machine learning techniques is why they are only applied only in the case of features that obviously express the same thing – and not also in the case of features for which it is not clear if they express the same thing (example: you want to predict which webshop visitors are likely to buy stuff and you measure,e.g., their mouse movements as well as the time they spent looking at products; these features might be correlated too).
For example, the whole search for correlations between the feature that are assumed to express the same thing could also be applied in the case I outlined above. Of course, applying multimodal techniques in the latter case might not yield anything, but it seems no one is even trying.

Get this bounty!!!

#StackBounty: #python #machine-learning #scikit-learn #jupyter #decision-tree Plot Interactive Decision Tree in Jupyter Notebook

Bounty: 50

Is there a way to plot a decision tree in a Jupyter Notebook, such that I can interactively explore its nodes? I am thinking about something like this dt. This is an example from KNIME.

I have found https://planspace.org/20151129-see_sklearn_trees_with_d3/ and https://bl.ocks.org/ajschumacher/65eda1df2b0dd2cf616f and I know you can run d3 in Jupyter, but I have not found any packages, that do that.

Get this bounty!!!

#StackBounty: #machine-learning #optimization #gradient-descent #sgd How to set up a linear system to interpolate the train data perfec…

Bounty: 50

Consider a (consistent) regression problem (i.e. we are trying to predict a real valued function and we don’t have inconsistencies in the way we map x’s to y).

I am trying to perfectly fit/interpolate the (train) data set with Gradient Descent (to understand academically Gradient Descent better) with fixed step size:

$$ w^{(t+1)} = w^{(t)} – eta nabla_w L(w^{(t)})$$

I’ve tried things empirically by minimizing L2 loss:

$$ L(w) = | Xw – y |^2 $$

I noticed that sometimes its hard to find the right step size such that the loss value $L(w)$ is zero within machine precision (fit/interpolate data in this sense https://arxiv.org/abs/1712.06559). I suspect that its highly dependent on the basis/kernels I use since the gradient and hessian are:

$$ nabla L(w) = 2(X^T X w – y)$$

$$ nabla^2 L(w) = X $$

I wanted to only use 1st order methods to solve this problem so I am wondering, how do I figure out a good step size and/or basis/feature matrix $X$ given that I want to solve this problem with first order method?

If I decide to use say, Hermitian polynomials, why would that be better than other polynomials for example if I want to fit/interpolate the data perfect?

What if I used a Gaussian Kernel or Lapacian Kernel? How would $X = K$ kernel matrix change and how would Gradient Descent be affected? How does the curvature change/get affected as I change the kernel matrix? How can I set up the problem so the optimization via (S)GD fits the data perfectly?

Get this bounty!!!

#StackBounty: #machine-learning #neural-networks #object-detection #labeling Do more object classes increase or decrease the accuracy o…

Bounty: 100

Assume you have an object detection dataset (e.g, MS COCO or Pascal VOC) with N images where k object classes have been labeled. You train a neural network (e.g., Faster-RCNN or YOLO) and measure the accuracy (e.g., IOU@0.5).

Now you introduce x additional object classes and add the corresponding labels to your original dataset giving you a dataset with N images where k+x object classes have been labeld.

Will the accuracy of the trained network increase or decrease?

To be more specific, we have a traffic sign dataset with around 20 object classes. Now we are thinking about adding additional traffic sign classes (labeling the new classes, without adding new images or changing our network architecture) and we are wondering if this will increase of decrease performance.

On the one hand I think more object classes will make distinction between classes harder. Additionally, a neural network can only hold a limited amount of information, meaning if the number of classes becomes very large there might just not be enough weights to cope with all classes.

On the other side, more object classes means more labels which may help the neural network. Additionally, transfer learning effects between classes might increase the accuracy of the network.

In my opinion there should be some kind of sweet-spot for each network architecture but I could not find any literature, research or experiments about this topic.

Get this bounty!!!

#StackBounty: #machine-learning #correlation #linear-model #canonical-correlation Canonical correlation analysis with a tiny example an…

Bounty: 150

I’ve tried reading many explanations of CCA, and I don’t understand it. For example, on Wikipedia, it refers to two “vectors” $a$ and $b$ such that $rho = text{corr}(a^{top} X, b^{top} Y)$ is maximal. But if $a$ and $X$ are vectors, isn’t $a^{top} X$ a scalar? What does it mean for two scalars to be correlated?

Other explanations use matrix notation without any dimensions, e.g.

enter image description here

Can someone explain CCA with a small example and all the dimensions provided?

Get this bounty!!!