## #StackBounty: #hypothesis-testing #statistical-significance #sampling #binomial #fishers-exact Test if two averages of two binomial dis…

### Bounty: 50

First, this may be a duplicate of:

statistical significance

I’m unsure if that post covers my exact situation. If so, just mark this as a duplicate. Let’s say I have a list of 100k potential clients and I sell cars. I select 10k clients based on a probability to buy a car from a model with many features. Each client has a different probability. I test the 10k clients on whether they buy a car or not across different periods of time and bucket the clients into four buckets based on, let’s say their income level.

For week 4, I have 1200 clients in the income bucket "$$50k-$$75k", and 1000 of those bought a car. I also have 1600 clients in income bucket "$$76k-$$100k" and 1100 of them bought a car. Can I use the fisher test to calculate the p-value between these two subgroups of clients in these specific buckets?

A couple things I am unsure of is the definition of sample in this experiment and if the size is too large for the fisher test. Also, I’m assuming this is a sample without replacement and does that work with the fisher test?

Are there better options than the fisher test?

Get this bounty!!!

## #StackBounty: #sampling #pdf #density-estimation #generative-models #normalizing-flow Inference in Normalizing Flow model: NICE(non lin…

### Bounty: 50

I was recently reading Y. Bengio’s paper on NICE (https://arxiv.org/abs/1410.8516). In the paper, authors have taken a view that a good representation involves easy learning of the data distribution. The proposed method therefore transforms the original distribution $$p_X(x)$$ into $$p_H(h)$$ using a bijective mapping $$f(x)=h$$. This is modeled by a neural network containing special type of layer called coupling layer.

Overall, the paper is an interesting read and is very well written. Hence, I decided to implement the model. I coded the equations in section 5 and maximized $$log(p_H(h)) + sum s_i$$ using Adam. However, there are certain this that are unclear to me.

1. How to draw samples from the model? The paper says that ancestral sampling can be used. I’m not sure how to do that. (so sample $$h tilde{} p_H(h)$$, then afterwards how can I get $$x$$?)
2. Also, how to sample $$p_{h_d}$$ in the first place , given in section 3.4
3. How to use this model for inpainting as described in section 5.2.

Get this bounty!!!

## #StackBounty: #sampling #multivariate-analysis #multivariate-normal how to construct and sample from conditional multi-variate normal d…

### Bounty: 50

The two random variables D and W representing the uncertainty around the daily and weekly prices. Both random variables follow a normal distribution. I want a probability distribution function (perhaps multi-variate and conditional) to represent the uncertainty around these two variables (Q1). Secondly, I want to get N samples, e.g. using the classic monte Carlo method (Q2).

For instance, let’s say $$w_1, w_2, …, w_n$$ represents the uncertainty around the weekly price with a mean of $$overline{w}$$ and a standard deviation of $$sigma_w$$. For each weekly sample, $$w_i$$, the daily price samples should have a mean equal to the corresponding weekly sample and the standard deviation as a function of the corresponding weekly price.
To better explain, $$D_{wi}^{dj}, j=1:M$$ are M daily price sample corresponding the weakly sample $$w_i$$. The mean of $$D_{wi}^{dj}, j=1:M$$ is equal to $$w_i$$ and its STD is $$w_i / 10$$.

How can I address Q1 and Q2 (finding the probability distribution function and sampling from it) for the given problem?

Get this bounty!!!

## #StackBounty: #r #hypothesis-testing #distributions #t-test #sampling How do I know whether my sample is fair?

### Bounty: 100

I have a time series dataset with 500 million rows, twenty-six columns and 400 thousand unique actors.

It’s too much data for me to process all at once, so I want to take a fair sample of my data.

I’ll spend some time talking about the factors:

1. Four of the columns are IDs and the 400K number references the most granular ID
2. Two of the columns are dates, one column for the day we recorded the row and another for the actor’s creation date
3. Eight of the columns are numeric variables that cluster around 0-5 and have a long tail up to hundreds
4. Eight of the columns are factors with five or fewer levels. Most of these factors cluster 80% of unique actors into one or two buckets and then split the remaining 20% across the remaining one to three buckets
5. Four of the columns are factors with lots of levels. The most levels for a factor is in the hundreds and the other three have about seventy levels. 80% of the rows can be attributed to the top fifty levels for the most populous factor and top twenty for the other three factors.

My plan is to take a simple random sample of 120 thousand unique actors and filter down my dataset with that sample

I’m concerned that I’ll get a sample that’s not representative of my population. I’m concerned, in part, because my data becomes extremely sparse with the less popular levels and the long tails for the numerics.

Get this bounty!!!

## #StackBounty: #regression #sampling #econometrics #sampling-distribution Question about the conceptual sampling distribution for one ti…

### Bounty: 50

in this paper: Do Political Protests Matter? Evidence from the Tea Party Movement*, the authors use rainfall on the day of the tea party protests as a source of plausibly exogenous variation in rally attendance, i.e. an as an instrumental variable.

I have a question conceptually about this- what exactly in a scenario like this doe we think about with the sampling distribution? The way i see it there are two potential ways to think about the conceptual sampling distribution:

1. rainfall fell as it did across the U.S. on the day of the protests. take that distribution as fixed. Now given how rainfall actually fell on that day, we can think of resampling and forming a distribution.
This would also have implications such as checking rainfall on that day and whether its uncorrelated with observables providing strong evidence of the identifying assumptions.
2. Rainfall is a part of the underlying dgp we are modeling with an equation like $$y = beta Rainfall + epsilon$$. So it is not just how rainfall happened to fall on that day as fixed, but the idea would be if we hypothetically went back and time and let the day play out again and again, rainfall would fall different ways each time, and this would generated sampling variability. In this case then, what matters is if the geoclimactic determinants of rainfall is a process that ‘assigns’ rain, and in each iteration/resampling- i.e. going back and starting the day over again- the county assignment of rainfall would be different. if this is the case, then looking at rainfall across time would be important to show that the process generating rain doesnt systematically correlate with determinants of y.

I hope those two ideas made sense, mainly so that someone can correct my logic or point me in the right direction for thinking of these types of things. Are one of the two the ‘correct’ way of thinking about the sampling distribution?

Get this bounty!!!

## #StackBounty: #regression #sampling #error #consistency #identification Basic Questions about regression formula, sampling variability,…

### Bounty: 50

lets say I run the simple regression, $$y_i = beta_o + beta_1x_i + epsilon_i$$.. Assume $$cov(epsilon,x)$$=0

This yields the formula people write in terms of covariances for the slope parameter:

$$hat{beta_1}$$ = $$frac{sum(x-bar{x})y_i}{sum({x-bar{x})^2}}$$

and then plugging in the true assumed dgp for y, we get:

= $$beta + frac{sum(x-bar{x})epsilon_i}{sum({x-bar{x})^2}}$$

With this, I have a few questions.

1. is this is now a statement not about the population, but the ‘draw’ of $$epsilon_i$$‘s we so happened to draw in this sample? so it is the numerator second term the $$textit{sample}$$ covariance between epsilon and x? if true, can I think of each random sample as a given draw of $$epsilon_i$$‘s, and that draw is what drives the sampling variability of the estimator?

2.taking probability limits, the covaraince =0 seems to be sufficient for consistency of the estimator. however, is covariance only not sufficient for unbiasedness? is mean indepence of $$epsilon$$ and x needec for finite sample properties?

1. An also a question about thinking about ‘identification’. if i think of the model above as the causal model, and I can say my ols is consistent, does that mean I have ‘identified’ the true $$beta_1$$? so can it hink of the model not being identified if the $$cov(epsilon,x) neq 0$$, which would say that $$hat{beta}$$ converges in probability to the true $$beta_1$$ + some other term? so I fail to isolate the underlying parameter?

Get this bounty!!!

## #StackBounty: #sampling #optimization #monte-carlo #minimum #numerical-integration Minimize asymptotic variance of fintely many estimates

### Bounty: 50

Let

• $$(E,mathcal E,lambda)$$ be a $$sigma$$-finite measure space;
• $$f:Eto[0,infty)^3$$ be a bounded Bochner integrable function on $$(E,mathcal E,lambda)$$ and $$p:=alpha_1f_1+alpha_2f_2+alpha_3f_3$$ for some $$alpha_1,alpha_2,alpha_3ge0$$ with $$alpha_1+alpha_2+alpha_3=1$$ and $$c:=int p:{rm d}lambdain(0,infty)tag1$$
• $$mu:=plambda$$
• $$Isubseteq$$ be a finite nonempty set
• $$r_i$$ be a probability density on $$(E,mathcal E,lambda)$$ with $$E_1:={p>0}subseteq{r_i>0}tag2$$ for $$iin I$$
• $$w_i:Eto[0,1]$$ be $$mathcal E$$-measurable for $$iin I$$ with $${p>0}subseteqleft{sum_{iin I}w_i=1right}tag3$$

I’m running the Metropolis-Hastings algorithm with target distribution $$mu$$ and use the generated chain to estimate $$lambda g$$ for $$mathcal E$$-measurable $$g:Eto[0,infty)^3$$ with $${p=0}subseteq{g=0}$$ and $$lambda|g|. The asymptotic variance of my estimate is given by $$sigma^2(g)=sum_{iin I}(mu w_i)int_{E_1}frac{left|g-frac pclambda gright|^2}{r_i}:{rm d}lambda.tag4$$ Given a finite system $$Hsubseteqmathcal E$$ with $$lambda(H)in(0,infty)$$ for all $$Hinmathcal H$$, I want to choose the $$w_i$$ such that $$max_{Hinmathcal H}sigma^2(1_Hf)tag5$$ is as small as possible.

It’s easy to see that, for fixed $$g$$, $$(3)$$ is minimized by choosing $$w_iequiv 1$$ for the $$iin I$$ which minimizes $$int_{E_1}frac{left|g-frac pclambda gright|^2}{r_i}:{rm d}lambda$$ and $$w_jequiv 0$$ for all other $$jin Isetminus{i}$$.

So, my idea is to bound $$(5)$$ by something which doesn’t depend on $$mathcal H$$ anymore. Maybe we can bound it by the “variance” of $$f$$ using the fact that the $$operatorname E[X]$$ is minimizing $$xmapstooperatorname Eleft[left|x-Xright|right]$$.

I’m not sure if this is sensible, but maybe it’s easier to consider $$sigma^2(g_H)$$ instead of $$sigma^2(1_Hf)$$, where $$g_H:=1_Hleft(f-frac{lambda(1_Hf)}{lambda(H)}right)$$. By the aforementioned idea we obtain $$sigma^2(g_H)leint_{E_1}frac{|f|^2}{r_i}:{rm d}lambda-frac1{theta_i}left|int_{E_1}frac f{r_i}:{rm d}lambdaright|^2tag5,$$ where $$theta_i=int_{E_1}frac1{r_i}:{rm d}lambda$$. Now we could estimate the right-hand side of $$(5)$$ using Monte Carlo integration and compute the index $$i$$ for which this estimate is minimal. Is this a good idea or can we do better?

Get this bounty!!!

## #StackBounty: #sampling #optimization #monte-carlo #minimum #numerical-integration Compute which of a finite number of integrals is min…

### Bounty: 50

Let

• $$(E,mathcal E,lambda)$$ be a $$sigma$$-finite measure space;
• $$f:Eto[0,infty)^3$$ be a bounded Bochner integrable function on $$(E,mathcal E,lambda)$$ and $$p:=alpha_1f_1+alpha_2f_2+alpha_3f_3$$ for some $$alpha_1,alpha_2,alpha_3ge0$$ with $$alpha_1+alpha_2+alpha_3=1$$ and $$int p:{rm d}lambdain(0,infty)tag1$$
• $$I$$ be a finite nonempty set
• $$r:(Itimes E)times E$$ be the density of a Markov kernel with source $$(Itimes E,2^Iotimesmathcal E)$$ and target $$(E,mathcal E)$$ with $$E_1:={p>0}subseteq{r((i,x),;cdot;)>0};;;text{for all }iin Itag2$$

Fix $$xin E$$. I want to find the index $$iin I$$ minimizing $$sigma_i:=lambda_ileft|f-frac{lambda_if}{lambda_i(E_1)}right|^2=lambda_i(E_1)operatorname E_ileft[left|f-operatorname E_i[f]right|^2right],$$ where $$lambda_i:=frac1{r((i,x),;cdot;)}left.lambdaright|_{E_1}$$ and $$operatorname E_i$$ is the expectation wrt $$operatorname P_i:=frac{lambda_i}{lambda_i(E_1)}$$ for $$iin I$$.

I’m not interested in the value $$sigma_i$$ itself.

Currently I’m estimating each $$sigma_i$$ using Monte Carlo integration and then compute the minimum. However, this is extremely slow.

I’m not familiar with this kind of problem and hence this might be nonsensical, but note that if $$Y_i$$ is an $$(E,mathcal E)$$-valued random variable with $$Y_isim r((i,x),;cdot;)lambda$$, then $$begin{equation}begin{split}&sigma_i=operatorname Eleft[1_{E_1}(Y_i)frac{|f(Y_i)|^2}{left|r((i,x),Y_i)right|^2}right]\&;;;;;;;;;;;;-left(operatorname Eleft[1_{E_1}(Y_i)frac1{left|r((i,x),Y_i)right|^2}right]right)^{-1}left|operatorname Eleft[1_{E_1}(Y_i)frac{f(Y_i)}{left|r((i,x),Y_i)right|^2}right]right|^2end{split}tag3end{equation}$$ for all $$iin I$$. So, we might approximate $$sigma_i$$ by an independent identically distributed process $$left(Y_i^{(n)}right){ninmathbb N}$$ with $$Y_i^{(1)}sim r((i,x),;cdot;)lambda$$ via $$A_i^{(n)}-frac1{B_i^{(n)}}left|C_i^{(n)}right|^2xrightarrow{ntoinfty}sigma_i;;;text{almost surely},tag4$$ where begin{align}A_i^{(n)}:=frac1nsum{i=1}^nfrac{1_{E_1}f}{left|r((i,x),;cdot;)right|^2}left(Y_i^{(n)}right),\B_i^{(n)}:=frac1nsum_{i=1}^nfrac{1_{E_1}}{left|r((i,x),;cdot;)right|^2}left(Y_i^{(n)}right),\C_i^{(n)}:=frac1nsum_{i=1}^nleft|frac{1_{E_1}f}{r((i,x),;cdot;)}right|^2left(Y_i^{(n)}right)end{align} for $$ninmathbb N$$, for all $$iin I$$.

Currently I’m using $$(4)$$ to estimate $$sigma_i$$ and then compute the minimum among the estimates. But this is extreme slowly and might even be erroneous.

Get this bounty!!!

## #StackBounty: #sampling #measurement-error Balancing measurement error with number of samples

### Bounty: 50

Suppose I am doing a physical experiment and would like to measure the output (random variable). Inherently, I introduce measurement errors when sampling the random variable. There are also sampling errors, due to sampling only a finite number of realizations of my random variable. Is there literature relating to how to balance these two types of errors?

It is not hard to imagine a scenario where I can take more samples if I reduce how precise of a measurement device I use, and so a natural question is how do I decide what precision to use.

For instance, suppose that by rounding my measurements to the nearest centimeter rather than millimeter, I can increase the number of samples I can take by a factor of 5. Which precision should I use?

I am aware of Sheppard’s corrections, but I don’t think those are general enough for all cases; e.g. if my data is discrete. Moreover, even in the continuous case, Sheppard’s corrections say that the measurement errors do not affect the mean. This is reasonable if you’re using a relatively fine measurement system, but is clearly not true if your measurement precision is very low.

To clarify, I am considering the case where the rounding error is a deterministic function of my original random variable; i.e. assume that I sample in infinite precision, and then round to my measurement system (say the integers).

Get this bounty!!!

## #StackBounty: #r #python #sampling #mcmc #kernel-smoothing Sampling from dataset according to distribution obtained from another dataset

### Bounty: 50

Suppose we have dataset $$A$$ with several categorical and numerical features:
$$A_{cat_1}$$, $$A_{cat_2}$$, $$ldots;$$ $$A_{num_1}$$, $$A_{num_2}$$, $$ldots;$$

Also we have another dataset $$B$$ with the same features, but probably with less categories (i.e. unique values) in features.

We want to sample from dataset $$B$$ according to joint distribution of $$A$$. How can one do it?

I was thinking in the following direction:

1. We can LabelEncode categorical features. After that we can use Kernel Density Estimation on both numerical and encoded categorical features. Is it the right way to estimate distribution of $$A$$ or maybe there is more correct procedure to estimate distribution in presence of categorical variables?
2. After we obtain KDE of $$A$$, we need to sample from $$B$$ according to that distribution. Could you describe (or maybe provide some code) how can one do it?

I admit this path itself could not be the best solution, so I am also open to better suggestions or any sources.

Get this bounty!!!