#StackBounty: #categorical-data #categorical-encoding #sparse #subset Compact encoding (vectorization) of unbounded sets

Bounty: 50

Question

I have a set of sets. Each set is unbounded.

I would like to find a methodology to encode (vectorize) each subset.

I am more specifically interested in memory efficient solutions.

Example

Let `X` be the superset and `A` and `B` be subsets.

$$X = {A, B}$$
$$A = {1,2,3}$$
$$B = {2,3,4}$$

A simple methodology to encode would be to use one-hot encoding:

$$vec A = [1, 1, 1, 0]$$
$$vec B = [0, 1, 1, 1]$$

Issue

Now my issue is when the subsets are large,
one-hot encoding can be unrealistic.
(10-30 thousand Sparse vector of unique values).

Any suggestions on encoding the inputs into a more dense vector would be appreciated.


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.