#StackBounty: #machine-learning #python #data-mining #text-mining #topic-model Compare two topic modelling sets

Bounty: 50

I have two sets of newspaper articles where I train the first newspaper dataset separately to get the topics per each newspaper article.

E.g., first newspaper dataset
article_1 = {'politics': 0.1, 'nature': 0.8, ..., 'sports':0, 'wild-life':1}

Again, I train my second newspaper dataset (from a different distributor) to get the topics per each newspaper article.

E.g., second newspaper dataset (from a different distributor)
article_2 = {'people': 0.3, 'animals': 0.7, ...., 'business':0.7, 'sports':0.2}

As shown in the examples, the topics I get from the two datasets are different, thus I manually matched similar topics based on their frequent words.

I want to identify whether the two newspaper distributors publish the same news in every week.

Hence, I am interested in knowing if there is a systematic way of comparing the topics across two corpora and measuring their similarity. Please help me.


Get this bounty!!!

#StackBounty: #machine-learning #classification #data-mining Predicting column use in a table

Bounty: 50

I have a set of tables $mathcal{T} = {T_1, …, T_n}$, where each $T_i$ is a collection of named columns ${c_0 .. c_{j_{i}}}$. In addition, I have a large sequence of observations $mathcal{D}$ of the form $(T_i, c_k)$, indicating that given access to table $T_i$ a user decided to use column $c_k$ for a particular task (not relevant to the problem formulation). Given a new table $T_j notin mathcal{T}$, I’d like to rank the columns of $T_j$ based on the likelihood that a user would pick that column for the same task.

My first intuition was to expand each observation $(T_i, c_k) in D$ into ${ (c_k, True) } cup { (c_j, False) | c_j in T_i land j neq k }$ and view this as a classification problem, and I can then use the probability of being in the positive class as my sorting metric. My issue with this is that it seems to me that this ignores that there is a relation between columns in a given table.

I also thought perhaps there is a reasonable approach to summarizing $T_i$, call this $phi$ and then making the problem $(phi(T_i), f(c_k))$, where $f$ is some function over the column.

I suspect this is a problem that people have tackled before, but I cannot seem to find good information. Any suggestions would be greatly appreciated.

[Update]

Here’s an idea I’ve been tossing around and was hoping I could get input from more knowledgeable people. Let’s assume users pick $c_j in T_i$ as a function of how “interesting” this column is. We can estimate the distribution that generated $c_j$, called this $hat{X}_j$. If we assume a normal distribution is “uninteresting”, then define $text{interest}(c_j) = delta(hat{X}_j, text{Normal})$, where we can define $delta$ to be some distance metric (e.g. https://en.wikipedia.org/wiki/Bhattacharyya_distance). The interest level of a table $text{interest}(T_i) = text{op}({text{interest}(hat{X}_j) | c_j in T_i})$, where $op$ is an aggregator (e.g. avg). Now I expand the original $(T_i, c_k) in mathcal{D}$ observations into triplets of $(text{interest}(T_i), text{interest}(c_j), c_j == c_k)$ and treat these as a classification problem. Thoughts?


Get this bounty!!!

#StackBounty: #data-mining How to treat incomplete variable values

Bounty: 50

I’m trying to analyze some fairly sparse data on a recurrent medical symptom, and I don’t know what to do with two entries where my data is incomplete.

My overall goal is a bit vague: it’s to find a pattern that hopefully will, with the help of doctors, find a cause. The symptom is not very serious, but annoying. Assume full access to all medical records.

I have data going back three years specifying what day the symptom occurred, and which days it did not. However, for two of the events, I only know that it happened “that month”.

Example:

2015,4,1,0,
2015,4,2,0,
2015,4,3,0,
2015,4,4,1,comment
2015,4,5,0,
...

(where the columns are year, month, day, 1 if symptom; 0 otherwise, and a comment)

My two incomplete entries look like:

2015,5,,1,symptom occurred twice this month
2015,5,,1,symptom occurred twice this month

Therefore, if I am going to perform an analysis using logistic regression or other methods, like just looking at graphs, I have a problem with these two entries because:

  1. I know the symptom occurred twice on a certain month;
  2. I do not know which day it occurred. So if I guess, or randomize the day, or use an average value, I am concerned I will falsify the data.

How should I treat these two missing “day” values knowing that I otherwise have a complete dataset going back three years?


Get this bounty!!!

#StackBounty: #data-mining #dataset #bigdata Yelp Dataset Archives?

Bounty: 50

The Yelp Dataset Challenge (https://www.yelp.com/dataset_challenge) releases data for a handful of cities each year. I’d like to analyze some cities from past years. Is this data archived anywhere?


Get this bounty!!!