# #StackBounty: #machine-learning #classification #data-mining Predicting column use in a table

### Bounty: 50

I have a set of tables \$mathcal{T} = {T_1, …, T_n}\$, where each \$T_i\$ is a collection of named columns \${c_0 .. c_{j_{i}}}\$. In addition, I have a large sequence of observations \$mathcal{D}\$ of the form \$(T_i, c_k)\$, indicating that given access to table \$T_i\$ a user decided to use column \$c_k\$ for a particular task (not relevant to the problem formulation). Given a new table \$T_j notin mathcal{T}\$, I’d like to rank the columns of \$T_j\$ based on the likelihood that a user would pick that column for the same task.

My first intuition was to expand each observation \$(T_i, c_k) in D\$ into \${ (c_k, True) } cup { (c_j, False) | c_j in T_i land j neq k }\$ and view this as a classification problem, and I can then use the probability of being in the positive class as my sorting metric. My issue with this is that it seems to me that this ignores that there is a relation between columns in a given table.

I also thought perhaps there is a reasonable approach to summarizing \$T_i\$, call this \$phi\$ and then making the problem \$(phi(T_i), f(c_k))\$, where \$f\$ is some function over the column.

I suspect this is a problem that people have tackled before, but I cannot seem to find good information. Any suggestions would be greatly appreciated.

[Update]

Here’s an idea I’ve been tossing around and was hoping I could get input from more knowledgeable people. Let’s assume users pick \$c_j in T_i\$ as a function of how “interesting” this column is. We can estimate the distribution that generated \$c_j\$, called this \$hat{X}_j\$. If we assume a normal distribution is “uninteresting”, then define \$text{interest}(c_j) = delta(hat{X}_j, text{Normal})\$, where we can define \$delta\$ to be some distance metric (e.g. https://en.wikipedia.org/wiki/Bhattacharyya_distance). The interest level of a table \$text{interest}(T_i) = text{op}({text{interest}(hat{X}_j) | c_j in T_i})\$, where \$op\$ is an aggregator (e.g. avg). Now I expand the original \$(T_i, c_k) in mathcal{D}\$ observations into triplets of \$(text{interest}(T_i), text{interest}(c_j), c_j == c_k)\$ and treat these as a classification problem. Thoughts?

Get this bounty!!!

This site uses Akismet to reduce spam. Learn how your comment data is processed.