*Bounty: 200*

*Bounty: 200*

Suppose I have a large sequence of size $M$ which contains $K$ unique items, where item $k$ occurs with unknown probability $pi_k$. I can choose to measure its quality, $x_k$, which is constant for a given item $k$.

My goal is to estimate the average quality (i.e., the true weighted average as well as CI around it):

$$frac{1}{K}sum_{k=1}^K pi_k x_k$$

One plan is to get a uniform sample of items $J$ from this sequence, and compute the average over each sampled item (since item $k$ is sampled with probability $pi_k$):

$$frac{1}{|J|} sum_{j in J} x_j$$

and estimate the variance of the estimator using the usual CLT-based approach.

Suppose, however, it’s also easy to compute the total number of times each item occurs, $(n_1, …, n_K)$. **Can I use this information to produce estimates with smaller confidence intervals?**

Not to bias the potential answers, but I feel like it should be possible to do, since I will have more information about $pi$, and therefore should be able to do some sort of variance reduction technique.

Also, to work through a specific example, I’ve been using the following distribution which mimics my actual usecase.

```
import numpy as np
# Suppose we K unique items
K=10000
freq = np.array([K/(i+100) for i in range(K)])
true_pi = freq / sum(freq)
true_x = np.array([.8 - .4*i/K for i in range(K)])
```