*Bounty: 100*

*Bounty: 100*

I used sklearn of Python for getting tf-idf attribute in text analysis, but the problem is: I have about 78000 words in train_set, but the tf-idf matrix only has 39000 words. What is the problem here?

Skip to content
# Tag: data mining

## #StackBounty: #machine-learning #data-mining #text-mining tf-idf in text mining

*Bounty: 100*

## #StackBounty: #machine-learning #deep-learning #data-mining #text-mining #rnn How to extract specific information from text using Machi…

*Bounty: 100*

## #StackBounty: #machine-learning #python #data-mining #text-mining #topic-model Compare two topic modelling sets

*Bounty: 50*

## #StackBounty: #machine-learning #classification #data-mining Predicting column use in a table

*Bounty: 50*

## #StackBounty: #data-mining How to treat incomplete variable values

*Bounty: 50*

## #StackBounty: #data-mining #dataset #bigdata Yelp Dataset Archives?

*Bounty: 50*

## #Machine #Learning: The Basics, with Ron Bekkerman

I used sklearn of Python for getting tf-idf attribute in text analysis, but the problem is: I have about 78000 words in train_set, but the tf-idf matrix only has 39000 words. What is the problem here?

Suppose I have a text like below which usually have 2/3 sentences and 100-200 characters.

*Johny bought milk of 50 dollars from walmart. Now he has left only 20 dollars.*

I want to extract

Person Name: Johny

Spent: 50 dollars

Money left: 20 dollars.

Spent where: Walmart.

I have gone through lots of material on Recurrent neural network. Watched cs231n video on RNN and understood the next character prediction. In these cases we have set of 26 characters that we can use as output classes to find the next character using probability. But here the problem seems entirely different because we don’t know the output classes. The output depends on the words and numbers in the text which can be any random word or number.

I read on Quora that convolutional neural network can also extract features on the text. Wondering if that can also solve this particular problem?

I have two sets of newspaper articles where I train the first newspaper dataset separately to get the topics per each newspaper article.

```
E.g., first newspaper dataset
article_1 = {'politics': 0.1, 'nature': 0.8, ..., 'sports':0, 'wild-life':1}
```

Again, I train my second newspaper dataset (from a different distributor) to get the topics per each newspaper article.

```
E.g., second newspaper dataset (from a different distributor)
article_2 = {'people': 0.3, 'animals': 0.7, ...., 'business':0.7, 'sports':0.2}
```

As shown in the examples, the topics I get from the two datasets are different, thus I manually matched similar topics based on their frequent words.

I want to identify whether the two newspaper distributors publish the same news in every week.

Hence, I am interested in knowing if there is a systematic way of comparing the topics across two corpora and measuring their similarity. Please help me.

I have a set of tables $mathcal{T} = {T_1, …, T_n}$, where each $T_i$ is a collection of named columns ${c_0 .. c_{j_{i}}}$. In addition, I have a large sequence of observations $mathcal{D}$ of the form $(T_i, c_k)$, indicating that given access to table $T_i$ a user decided to use column $c_k$ for a particular task (not relevant to the problem formulation). Given a new table $T_j notin mathcal{T}$, I’d like to rank the columns of $T_j$ based on the likelihood that a user would pick that column for the same task.

My first intuition was to expand each observation $(T_i, c_k) in D$ into ${ (c_k, True) } cup { (c_j, False) | c_j in T_i land j neq k }$ and view this as a classification problem, and I can then use the probability of being in the positive class as my sorting metric. My issue with this is that it seems to me that this ignores that there is a relation between columns in a given table.

I also thought perhaps there is a reasonable approach to summarizing $T_i$, call this $phi$ and then making the problem $(phi(T_i), f(c_k))$, where $f$ is some function over the column.

I suspect this is a problem that people have tackled before, but I cannot seem to find good information. Any suggestions would be greatly appreciated.

[Update]

Here’s an idea I’ve been tossing around and was hoping I could get input from more knowledgeable people. Let’s assume users pick $c_j in T_i$ as a function of how “interesting” this column is. We can estimate the distribution that generated $c_j$, called this $hat{X}_j$. If we assume a normal distribution is “uninteresting”, then define $text{interest}(c_j) = delta(hat{X}_j, text{Normal})$, where we can define $delta$ to be some distance metric (e.g. https://en.wikipedia.org/wiki/Bhattacharyya_distance). The interest level of a table $text{interest}(T_i) = text{op}({text{interest}(hat{X}_j) | c_j in T_i})$, where $op$ is an aggregator (e.g. avg). Now I expand the original $(T_i, c_k) in mathcal{D}$ observations into triplets of $(text{interest}(T_i), text{interest}(c_j), c_j == c_k)$ and treat these as a classification problem. Thoughts?

I’m trying to analyze some fairly sparse data on a recurrent medical symptom, and I don’t know what to do with two entries where my data is incomplete.

My overall goal is a bit vague: it’s to find a pattern that hopefully will, with the help of doctors, find a cause. The symptom is not very serious, but annoying. Assume full access to all medical records.

I have data going back three years specifying what day the symptom occurred, and which days it did not. However, for two of the events, I only know that it happened “that month”.

Example:

```
2015,4,1,0,
2015,4,2,0,
2015,4,3,0,
2015,4,4,1,comment
2015,4,5,0,
...
```

(where the columns are year, month, day, 1 if symptom; 0 otherwise, and a comment)

My two incomplete entries look like:

```
2015,5,,1,symptom occurred twice this month
2015,5,,1,symptom occurred twice this month
```

Therefore, if I am going to perform an analysis using logistic regression or other methods, like just looking at graphs, I have a problem with these two entries because:

- I know the symptom occurred twice on a certain month;
- I do not know which day it occurred. So if I guess, or randomize the day, or use an average value, I am concerned I will falsify the data.

How should I treat these two missing “day” values knowing that I otherwise have a complete dataset going back three years?

The Yelp Dataset Challenge (https://www.yelp.com/dataset_challenge) releases data for a handful of cities each year. I’d like to analyze some cities from past years. Is this data archived anywhere?