# #StackBounty: #streaming-algorithms #streaming Estimating the cardinality of multisets given some sets of different sizes

### Bounty: 50

The count-distinct problem is: given a stream of elements $$x_1, dots, x_n$$ which contains duplicates, find the number of distinct elements using a small amount of space.

I also want the following property: given two or more streams $$S_1, dots, S_k$$ we see the streams one after the other. After we see the streams, we are given two streams at random $$S_i$$ and $$S_j$$ and asked to return the number of unique elements in $$S_i setminus S_j$$. We would like to solve this problem using as small amount of memory as possible; notably we cannot afford to remember all the elements in all the streams before the query of $$S_i setminus S_j$$.

This problem can be solved using the HyperLogLog algorithm. However, this algorithm really only works when $$|S_i|$$ for all $$i$$ are about the same length. This is because the error probability is given by $$O(1/sqrt{m})$$ where $$m$$ is the amount of memory used to store the hash values of the HyperLogLog algorithm. Suppose we care about small space $$m = log(n)$$ where $$n = max_{S_i} (|S_i|)$$. Then, suppose the query asks for $$S_i$$ and $$S_j$$ where $$|S_i| < (1/sqrt{log n}) |S_j|$$. In this case, we would not achieve a good error bound on $$|S_i setminus S_j|$$ (by the inclusion-exclusion principle $$|S_i setminus S_j| = |S_i cup S_j| – |S_j|$$) since the error bound on $$S_j$$ could mask the number of elements in $$S_i$$.

Are there any other existing data structures that can handle this when the sizes of $$S_i$$ differ by more than a $$(1/sqrt{m})$$-factor? Please do comment if any part of this question is unclear.

Get this bounty!!!

This site uses Akismet to reduce spam. Learn how your comment data is processed.