## #StackBounty: #nlp #pandas #word-embeddings #bert how to run bert's pretrained model word embeddings faster?

### Bounty: 50

I’m trying to get word embeddings for clinical data using microsoft/pubmedbert.
I have 3.6 million text rows. Converting texts to vectors for 10k rows takes around 30 minutes. So for 3.6 million rows, it would take around – 180 hours(8days approx).

Is there any method where I can speed up the process?

My code –

``````from transformers import AutoTokenizer
from transformers import pipeline
model_name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('feature-extraction',model=model_name, tokenizer=tokenizer)

def lambda_func(row):
tokens = tokenizer(row['notetext'])
if len(tokens['input_ids'])>512:
tokens = re.split(r'b', row['notetext'])
tokens= [t for t in tokens if len(t) > 0 ]
row['notetext'] = ''.join(tokens[:512])
row['vectors'] = classifier(row['notetext'])[0][0]
return row

def process(progress_notes):
progress_notes = progress_notes.apply(lambda_func, axis=1)
return progress_notes

progress_notes = process(progress_notes)
vectors_df = pd.DataFrame(vectors_2d)
``````

My progress_notes dataframe looks like –

``````progress_notes = pd.DataFrame({'id':[1,2,3],'progressnotetype':['Nursing Note', 'Nursing Note', 'Administration Note'], 'notetext': ['Patient's skin is grossly intact with exception of skin tear to r inner elbow and r lateral lower leg','Patient with history of Afib with RVR. Patient is incontinent of bowel and bladder.','Give 2 tablet by mouth every 4 hours as needed for Mild to moderate Pain Not to exceed 3 grams in 24 hours']})
``````

Note – 1) I’m running the code on aws ec2 instance r5.8x large(32 CPUs) – I tried using multiprocessing but the code goes into a deadlock because bert takes all my cpu cores.

Get this bounty!!!

## #StackBounty: #nlp #pandas #word-embeddings #bert how to run bert's pretrained model word embeddings faster?

### Bounty: 50

I’m trying to get word embeddings for clinical data using microsoft/pubmedbert.
I have 3.6 million text rows. Converting texts to vectors for 10k rows takes around 30 minutes. So for 3.6 million rows, it would take around – 180 hours(8days approx).

Is there any method where I can speed up the process?

My code –

``````from transformers import AutoTokenizer
from transformers import pipeline
model_name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('feature-extraction',model=model_name, tokenizer=tokenizer)

def lambda_func(row):
tokens = tokenizer(row['notetext'])
if len(tokens['input_ids'])>512:
tokens = re.split(r'b', row['notetext'])
tokens= [t for t in tokens if len(t) > 0 ]
row['notetext'] = ''.join(tokens[:512])
row['vectors'] = classifier(row['notetext'])[0][0]
return row

def process(progress_notes):
progress_notes = progress_notes.apply(lambda_func, axis=1)
return progress_notes

progress_notes = process(progress_notes)
vectors_df = pd.DataFrame(vectors_2d)
``````

My progress_notes dataframe looks like –

``````progress_notes = pd.DataFrame({'id':[1,2,3],'progressnotetype':['Nursing Note', 'Nursing Note', 'Administration Note'], 'notetext': ['Patient's skin is grossly intact with exception of skin tear to r inner elbow and r lateral lower leg','Patient with history of Afib with RVR. Patient is incontinent of bowel and bladder.','Give 2 tablet by mouth every 4 hours as needed for Mild to moderate Pain Not to exceed 3 grams in 24 hours']})
``````

Note – 1) I’m running the code on aws ec2 instance r5.8x large(32 CPUs) – I tried using multiprocessing but the code goes into a deadlock because bert takes all my cpu cores.

Get this bounty!!!

## #StackBounty: #nlp #pandas #word-embeddings #bert how to run bert's pretrained model word embeddings faster?

### Bounty: 50

I’m trying to get word embeddings for clinical data using microsoft/pubmedbert.
I have 3.6 million text rows. Converting texts to vectors for 10k rows takes around 30 minutes. So for 3.6 million rows, it would take around – 180 hours(8days approx).

Is there any method where I can speed up the process?

My code –

``````from transformers import AutoTokenizer
from transformers import pipeline
model_name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('feature-extraction',model=model_name, tokenizer=tokenizer)

def lambda_func(row):
tokens = tokenizer(row['notetext'])
if len(tokens['input_ids'])>512:
tokens = re.split(r'b', row['notetext'])
tokens= [t for t in tokens if len(t) > 0 ]
row['notetext'] = ''.join(tokens[:512])
row['vectors'] = classifier(row['notetext'])[0][0]
return row

def process(progress_notes):
progress_notes = progress_notes.apply(lambda_func, axis=1)
return progress_notes

progress_notes = process(progress_notes)
vectors_df = pd.DataFrame(vectors_2d)
``````

My progress_notes dataframe looks like –

``````progress_notes = pd.DataFrame({'id':[1,2,3],'progressnotetype':['Nursing Note', 'Nursing Note', 'Administration Note'], 'notetext': ['Patient's skin is grossly intact with exception of skin tear to r inner elbow and r lateral lower leg','Patient with history of Afib with RVR. Patient is incontinent of bowel and bladder.','Give 2 tablet by mouth every 4 hours as needed for Mild to moderate Pain Not to exceed 3 grams in 24 hours']})
``````

Note – 1) I’m running the code on aws ec2 instance r5.8x large(32 CPUs) – I tried using multiprocessing but the code goes into a deadlock because bert takes all my cpu cores.

Get this bounty!!!

## #StackBounty: #neural-networks #natural-language #word-embeddings Downweight or partially mask certain inputs to Neural Network

### Bounty: 50

I have an NLP classification task for sentences set up, in which the goal is to predict a sentence label that depends on the primary verb used in the sentence. This task can be solved by just memorizing the verb – label association, but I would like to regularize or “encourage” the model to use information from the surrounding context as well, so that the model can generalize well to unseen verbs. Fully masking the embedding corresponding to the verb results in an underspecified task, since information from the verb is needed to fully determine the label.

In short, I’d like a way to partially mask or downweight a specific input embedding to a neural network classifier, to encourage the network to use information from the surrounding context in addition to that input. I’ve thought about rescaling the verb embedding by a constant $$c < 1$$, but then $$c$$ becomes a hyperparameter that I would have to set somewhat arbitrarily.
Any suggestions or pointers to references would be greatly appreciated. Thanks!

Get this bounty!!!

## #StackBounty: #word2vec #word-embeddings #bert BERT vs Word2VEC: Is bert disambiguating the meaning of the word vector?

### Bounty: 50

Word2vec:
Word2vec provides a vector for each token/word and those vectors encode the meaning of the word. Although those vectors are not human interpretable, the meaning of the vectors are understandable/interpretable by comparing with other vectors (for example, the vector of `dog` will be most similar to the vector of `cat`), and various interesting equations (for example `king-men+women=queen`, which proves how well those vectors hold the semantic of words).

The problem with word2vec is that each word has only one vector but in the real world each word has different meaning depending on the context and sometimes the meaning can be totally different (for example, `bank as a financial institute` vs `bank of the river`).

Bert:
One important difference between Bert/ELMO (dynamic word embedding) and Word2vec is that these models consider the context and for each token, there is a vector.

Now the question is, do vectors from Bert hold the behaviors of word2Vec and solve the meaning disambiguation problem (as this is a contextual word embedding)?

Experiments
To get the vectors from google’s pre-trained model, I used bert-embedding-1.0.1 library.
I first tried to see whether it hold the similarity property. To test, I took the first paragraphs from wikipedia page of Dog, Cat, and Bank (financial institute). The similar word for `dog` is:
(‘dog’,1.0)
(‘wolf’, 0.7254540324211121)
(‘domestic’, 0.6261438727378845)
(‘cat’, 0.6036421656608582)
(‘canis’, 0.5722522139549255)
(‘mammal’, 0.5652133226394653)
Here, the first element is token and second is the similarity.

Now for disambiguation test:
Along with Dog, Cat and Bank (financtial institute), I added a paragraph of River bank from wikipedia. This is to check that bert can differentiate between two different types of `Bank`. Here the hope is, the vector of token bank (of river) will be close to vector of `river` or `water` but far away from `bank(financial institute)`, `credit`, `financial` etc. Here is the result:
The second element is the sentence to show the context.

``````('bank', 'in geography , the word bank generally refers to the land alongside a body of water . different structures are referred to as', 1.0)
('bank', 'a bank is a financial institution that accepts deposits from the public and creates credit .', 0.7796692848205566)
('bank', 'in limnology , a stream bank or river bank is the terrain alongside the bed of a river , creek , or', 0.7275459170341492)
('bank', 'in limnology , a stream bank or river bank is the terrain alongside the bed of a river , creek , or', 0.7121304273605347)
('bank', 'the bank consists of the sides of the channel , between which the flow is confined .', 0.6965076327323914)
('banks', 'markets to their importance in the financial stability of a country , banks are highly regulated in most countries .', 0.6590269804000854)
('banking', 'most nations have institutionalized a system known as fractional reserve banking under which banks hold liquid assets equal to only a', 0.6490173935890198)
('banks', 'most nations have institutionalized a system known as fractional reserve banking under which banks hold liquid assets equal to only a', 0.6224181652069092)
('financial', 'a bank is a financial institution that accepts deposits from the public and creates credit .', 0.614281952381134)
('banks', 'stream banks are of particular interest in fluvial geography , which studies the processes associated with rivers and streams and the deposits', 0.6096583604812622)
('structures', 'in geography , the word bank generally refers to the land alongside a body of water . different structures are referred to as', 0.5771245360374451)
('financial', 'markets to their importance in the financial stability of a country , banks are highly regulated in most countries .', 0.5701562166213989)
('reserve', 'most nations have institutionalized a system known as fractional reserve banking under which banks hold liquid assets equal to only a', 0.5462549328804016)
('institution', 'a bank is a financial institution that accepts deposits from the public and creates credit .', 0.537483811378479)
('land', 'in geography , the word bank generally refers to the land alongside a body of water . different structures are referred to as', 0.5331911444664001)
('of', 'in geography , the word bank generally refers to the land alongside a body of water . different structures are referred to as', 0.527492105960846)
('water', 'in geography , the word bank generally refers to the land alongside a body of water . different structures are referred to as', 0.5234918594360352)
('banks', 'bankfull discharge is a discharge great enough to fill the channel and overtop the banks .', 0.5213838815689087)
('lending', 'lending activities can be performed either directly or indirectly through due capital .', 0.5207482576370239)
('deposits', 'a bank is a financial institution that accepts deposits from the public and creates credit .', 0.5131596922874451)
('stream', 'in limnology , a stream bank or river bank is the terrain alongside the bed of a river , creek , or', 0.5108630061149597)
('bankfull', 'bankfull discharge is a discharge great enough to fill the channel and overtop the banks .', 0.5102289915084839)
('river', 'in limnology , a stream bank or river bank is the terrain alongside the bed of a river , creek , or', 0.5099104046821594)
``````

Here, the result of the most similar vectors of bank (as a river bank, the token is taken from the context of the first row and that is why the similarity score is 1.0. So, the second one is the closest vector). From the result, it can be seen that the first most close token’s meaning and context is very different. Even the token `river`, `water and`stream` has lower similarity.

So, it seems that the vectors do not really disambiguate the meaning.
Why is that?
Isn’t the contextual token vector supposed to disambiguate the meaning of a word?

Get this bounty!!!

## #StackBounty: #word2vec #word-embeddings #bert BERT vs Word2VEC: Is bert disambuguate the meaning of the word vector?

### Bounty: 50

Word2vec:
Word2vec provides a vector for each token/word and those vectors encode the meaning of the word. Although those vectors are not human interpretable, the meaning of the vectors are understandable/interpretable by comparing with other vectors (for example, the vector of `dog` will be most similar to the vector of `cat`), and various interesting equations (for example `king-men+women=queen`, which proves how well those vectors hold the semantic of words).

The problem with word2vec is that each word has only one vector but in the real world each word has different meaning depending on the context and sometimes the meaning can e totally different (for example, `bank as a financial institute` vs `bank of the river`).

Bert:
One important difference between Bert/ELMO (dynamic word embedding) and Word2vec is that these models consider the context and for each token, there is a vector.

Now the question is, do vectors from Bert hold the behaviors of word2Vec and solve the meaning disambiguation problem (as this is a contextual word embedding)?

Experiments
To get the vectors from google’s pre-trained model, I used bert-embedding-1.0.1 library.
I first tried to see whether it hold the similarity property. To test, I took the first paragraphs from wikipedia page of Dog, Cat, and Bank (financial institute). The similar word for `dog` is:
(‘dog’,1.0)
(‘wolf’, 0.7254540324211121)
(‘domestic’, 0.6261438727378845)
(‘cat’, 0.6036421656608582)
(‘canis’, 0.5722522139549255)
(‘mammal’, 0.5652133226394653)
Here, the first element is token and second is the similarity.

Now for disambiguation test:
Along with Dog, Cat and Bank (financtial institute), I added a paragraph of River bank from wikipedia. This is to check that bert can differentiate between two different type of `Bank`. Here the hope is, the vector of token bank (of river) will be close to vector of `river` or `water` but far away from `bank(financial institute)`, `credit`, `financial` etc. Here is the result:
The second element is the sentence to show the context.

``````('bank', 'in geography , the word bank generally refers to the land alongside a body of water . different structures are referred to as', 1.0)
('bank', 'a bank is a financial institution that accepts deposits from the public and creates credit .', 0.7796692848205566)
('bank', 'in limnology , a stream bank or river bank is the terrain alongside the bed of a river , creek , or', 0.7275459170341492)
('bank', 'in limnology , a stream bank or river bank is the terrain alongside the bed of a river , creek , or', 0.7121304273605347)
('bank', 'the bank consists of the sides of the channel , between which the flow is confined .', 0.6965076327323914)
('banks', 'markets to their importance in the financial stability of a country , banks are highly regulated in most countries .', 0.6590269804000854)
('banking', 'most nations have institutionalized a system known as fractional reserve banking under which banks hold liquid assets equal to only a', 0.6490173935890198)
('banks', 'most nations have institutionalized a system known as fractional reserve banking under which banks hold liquid assets equal to only a', 0.6224181652069092)
('financial', 'a bank is a financial institution that accepts deposits from the public and creates credit .', 0.614281952381134)
('banks', 'stream banks are of particular interest in fluvial geography , which studies the processes associated with rivers and streams and the deposits', 0.6096583604812622)
('structures', 'in geography , the word bank generally refers to the land alongside a body of water . different structures are referred to as', 0.5771245360374451)
('financial', 'markets to their importance in the financial stability of a country , banks are highly regulated in most countries .', 0.5701562166213989)
('reserve', 'most nations have institutionalized a system known as fractional reserve banking under which banks hold liquid assets equal to only a', 0.5462549328804016)
('institution', 'a bank is a financial institution that accepts deposits from the public and creates credit .', 0.537483811378479)
('land', 'in geography , the word bank generally refers to the land alongside a body of water . different structures are referred to as', 0.5331911444664001)
('of', 'in geography , the word bank generally refers to the land alongside a body of water . different structures are referred to as', 0.527492105960846)
('water', 'in geography , the word bank generally refers to the land alongside a body of water . different structures are referred to as', 0.5234918594360352)
('banks', 'bankfull discharge is a discharge great enough to fill the channel and overtop the banks .', 0.5213838815689087)
('lending', 'lending activities can be performed either directly or indirectly through due capital .', 0.5207482576370239)
('deposits', 'a bank is a financial institution that accepts deposits from the public and creates credit .', 0.5131596922874451)
('stream', 'in limnology , a stream bank or river bank is the terrain alongside the bed of a river , creek , or', 0.5108630061149597)
('bankfull', 'bankfull discharge is a discharge great enough to fill the channel and overtop the banks .', 0.5102289915084839)
('river', 'in limnology , a stream bank or river bank is the terrain alongside the bed of a river , creek , or', 0.5099104046821594)
``````

Here, the result of the most similar vectors of bank (as a river bank, the token is taken from the context of the first row and that is why the similarity score is 1.0. So, the second one is the most close vector). From the result it can be seen that the first most close token’s meaning and context is very different. Even the token `river`, `water and`stream` has lower similarity.

So, it seems that, the vectors does not really disamguate the meaning.
Why is that?
is not the contextual token vector supposed to disambiguate the meaning of word?

Get this bounty!!!

## #StackBounty: #nlp #recommender-system #word-embeddings #information-retrieval Building a tag-based recommendation engine given a set o…

### Bounty: 100

Basically, the idea is to have users following tags on the site, so each users has a set of tags they are following. And then there is a document collection where each document in the collection has a Title, Description, and a set of tags which are of relevance to the topic being discussed in the document as determined by the author. What is the best way of recommending documents to the user, given the information we have, which would also take into consideration the semantic relevance of the title and description of a document to the user’s tags, whether that be a word embeddings solution or a tf-idf solution, or a mix, do tell. I still don’t know what I’m going to do about tag synonyms, it might have to be a collaborative effort like on stackoverflow, but if there is a solution to this or a pseudo-solution, and I’m writing this in C# using the Lucene.NET library, if that is of any relevance to you.

Get this bounty!!!

## #StackBounty: #recommender-system #word-embeddings #natural-language-process #information-retrieval Building a tag-based recommendation…

### Bounty: 100

Basically, the idea is to have users following tags on the site, so each users has a set of tags they are following. And then there is a document collection where each document in the collection has a Title, Description, and a set of tags which are of relevance to the topic being discussed in the document as determined by the author. What is the best way of recommending documents to the user, given the information we have, which would also take into consideration the semantic relevance of the title and description of a document to the user’s tags, whether that be a word embeddings solution or a tf-idf solution, or a mix, do tell. I still don’t know what I’m going to do about tag synonyms, it might have to be a collaborative effort like on stackoverflow, but if there is a solution to this or a pseudo-solution, and I’m writing this in C# using the Lucene.NET library, if that is of any relevance to you.

Get this bounty!!!

## #StackBounty: #nlp #word-embeddings #one-hot-encoding Does Fasttext use One Hot Encoding?

### Bounty: 50

In the original Skipgram/CBOW both context word and target word are represented as one-hot encoding.

Does fasttext also use one-hot encoding for each subword when training the skip-gram/CBOW model (so the length of the one-hot encoding vector is |Vocab| + |all subwords|)? If they use it, do they use it in both context and target words?

Get this bounty!!!

## #StackBounty: #nlp #word-embeddings #one-hot-encoding Is Fasttext still use One Hot Encoding?

### Bounty: 50

In the original Skipgram/CBOW both context word and target word is represented as one hot encoding.

Is fasttext also use one-hot encoding for each subwords when train the skipgram/CBOW model (so the length of the one-hot encoding vector is |Vocab| + |all subwords|)? if they use it, are they use it in both context and target word?

Get this bounty!!!