#StackBounty: #streaming-algorithms #streaming Estimating the cardinality of multisets given some sets of different sizes

Bounty: 50

The count-distinct problem is: given a stream of elements $x_1, dots, x_n$ which contains duplicates, find the number of distinct elements using a small amount of space.

I also want the following property: given two or more streams $S_1, dots, S_k$ we see the streams one after the other. After we see the streams, we are given two streams at random $S_i$ and $S_j$ and asked to return the number of unique elements in $S_i setminus S_j$. We would like to solve this problem using as small amount of memory as possible; notably we cannot afford to remember all the elements in all the streams before the query of $S_i setminus S_j$.

This problem can be solved using the HyperLogLog algorithm. However, this algorithm really only works when $|S_i|$ for all $i$ are about the same length. This is because the error probability is given by $O(1/sqrt{m})$ where $m$ is the amount of memory used to store the hash values of the HyperLogLog algorithm. Suppose we care about small space $m = log(n)$ where $n = max_{S_i} (|S_i|)$. Then, suppose the query asks for $S_i$ and $S_j$ where $|S_i| < (1/sqrt{log n}) |S_j|$. In this case, we would not achieve a good error bound on $|S_i setminus S_j|$ (by the inclusion-exclusion principle $|S_i setminus S_j| = |S_i cup S_j| – |S_j|$) since the error bound on $S_j$ could mask the number of elements in $S_i$.

Are there any other existing data structures that can handle this when the sizes of $S_i$ differ by more than a $(1/sqrt{m})$-factor? Please do comment if any part of this question is unclear.


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.